scispace - formally typeset
Search or ask a question

Showing papers in "Biometrika in 2018"


Journal ArticleDOI
TL;DR: This work shows that, with at least two independent proxy variables satisfying a certain rank condition, the causal effect is nonparametrically identified, even if the measurement error mechanism, i.e., the conditional distribution of the proxies given the confounder, may not be identified.
Abstract: We consider a causal effect that is confounded by an unobserved variable, but with observed proxy variables of the confounder. We show that, with at least two independent proxy variables satisfying a certain rank condition, the causal effect is nonparametrically identified, even if the measurement error mechanism, i.e., the conditional distribution of the proxies given the confounder, may not be identified. Our result generalizes the identification strategy of Kuroki & Pearl (2014) that rests on identification of the measurement error mechanism. When only one proxy for the confounder is available, or the required rank condition is not met, we develop a strategy to test the null hypothesis of no causal effect.

181 citations


Journal ArticleDOI
TL;DR: A diverse range of p-value combination methods appear in the literature, each with different statistical properties as mentioned in this paper, and the final choice used in a meta-analysis can appear arbitrary, as if all effort has been expended building the models that gave rise to the pvalues.
Abstract: Combining p-values from independent statistical tests is a popular approach to meta-analysis, particularly when the data underlying the tests are either no longer available or are difficult to combine. A diverse range of p-value combination methods appear in the literature, each with different statistical properties. Yet all too often the final choice used in a meta-analysis can appear arbitrary, as if all effort has been expended building the models that gave rise to the p-values. Birnbaum (1954) showed that any reasonable p-value combiner must be optimal against some alternative hypothesis. Starting from this perspective and recasting each method of combining p-values as a likelihood ratio test, we present theoretical results for some of the standard combiners which provide guidance about how a powerful combiner might be chosen in practice.

123 citations


Journal ArticleDOI
TL;DR: This work proposes a method that attains uniform approximate balance for covariate functions in a reproducing-kernel Hilbert space and shows that it achieves better balance with smaller sampling variability than existing methods.
Abstract: Covariate balance is often advocated for objective causal inference since it mimics randomization in observational data. Unlike methods that balance specific moments of covariates, our proposal attains uniform approximate balance for covariate functions in a reproducing-kernel Hilbert space. The corresponding infinite-dimensional optimization problem is shown to have a finite-dimensional representation in terms of an eigenvalue optimization problem. Large-sample results are studied, and numerical examples show that the proposed method achieves better balance with smaller sampling variability than existing methods.

71 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate the use of proper scoring rules for high-dimensional peaks-over-threshold inference, focusing on extreme-value processes associated with log-Gaussian random functions, and compare gradient score estimators with the spectral and censored likelihood estimators for regularly varying distributions with normalized marginals, using data with several hundred locations.
Abstract: SummaryMax-stable processes are increasingly widely used for modelling complex extreme events, but existing fitting methods are computationally demanding, limiting applications to a few dozen variables. ${r}$-Pareto processes are mathematically simpler and have the potential advantage of incorporating all relevant extreme events, by generalizing the notion of a univariate exceedance. In this paper we investigate the use of proper scoring rules for high-dimensional peaks-over-threshold inference, focusing on extreme-value processes associated with log-Gaussian random functions, and compare gradient score estimators with the spectral and censored likelihood estimators for regularly varying distributions with normalized marginals, using data with several hundred locations. When simulating from the true model, the spectral estimator performs best, closely followed by the gradient score estimator, but censored likelihood estimation performs better with simulations from the domain of attraction, though it is outperformed by the gradient score in cases of weak extremal dependence. We illustrate the potential and flexibility of our ideas by modelling extreme rainfall on a grid with 3600 locations, based on exceedances for locally intense and for spatially accumulated rainfall, and discuss diagnostics of model fit. The differences between the two fitted models highlight how the definition of rare events affects the estimated dependence structure.

67 citations


Journal ArticleDOI
TL;DR: In this article, robust matrix estimators for a much richer class of distributions are presented, under a bounded fourth moment assumption, achieving the same minimax convergence rates as do existing methods under a sub-Gaussianity assumption.
Abstract: High-dimensional data are often most plausibly generated from distributions with complex structure and leptokurtosis in some or all components. Covariance and precision matrices provide a useful summary of such structure, yet the performance of popular matrix estimators typically hinges upon a sub-Gaussianity assumption. This paper presents robust matrix estimators whose performance is guaranteed for a much richer class of distributions. The proposed estimators, under a bounded fourth moment assumption, achieve the same minimax convergence rates as do existing methods under a sub-Gaussianity assumption. Consistency of the proposed estimators is also established under the weak assumption of bounded 2 + e moments for e ∈ (0, 2). The associated convergence rates depend on e.

65 citations


Journal ArticleDOI
TL;DR: In this paper, an extension of the stochastic block model for recurrent interaction events in continuous time is proposed, where every individual belongs to a latent group and conditional interactions between two individuals follow an inhomogeneous Poisson process with intensity driven by the individuals' latent groups.
Abstract: We propose an extension of the stochastic block model for recurrent interaction events in continuous time, where every individual belongs to a latent group and conditional interactions between two individuals follow an inhomogeneous Poisson process with intensity driven by the individuals’ latent groups. We show that the model is identifiable and estimate it with a semiparametric variational expectation-maximization algorithm. We develop two versions of the method, one using a nonparametric histogram approach with an adaptive choice of the partition size, and the other using kernel intensity estimators. We select the number of latent groups by an integrated classification likelihood criterion. We demonstrate the performance of our procedure on synthetic experiments, analyse two datasets to illustrate the utility of our approach, and comment on competing methods.

62 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the asymptotic behavior of the posterior obtained from approximate Bayesian computation (ABC) and the ensuing posterior mean, and give general results on: (i) the rate of concentration of the ABC posterior on sets containing the true parameter (vector); (ii) the limiting shape of posterior; and (iii) the asyptotic distribution of ABC posterior mean.
Abstract: Approximate Bayesian computation (ABC) is becoming an accepted tool for statistical analysis in models with intractable likelihoods. With the initial focus being primarily on the practical import of ABC, exploration of its formal statistical properties has begun to attract more attention. In this paper we consider the asymptotic behavior of the posterior obtained from ABC and the ensuing posterior mean. We give general results on: (i) the rate of concentration of the ABC posterior on sets containing the true parameter (vector); (ii) the limiting shape of the posterior; and\ (iii) the asymptotic distribution of the ABC posterior mean. These results hold under given rates for the tolerance used within ABC, mild regularity conditions on the summary statistics, and a condition linked to identification of the true parameters. Using simple illustrative examples that have featured in the literature, we demonstrate that the required identification condition is far from guaranteed. The implications of the theoretical results for practitioners of ABC are also highlighted.

60 citations


Journal ArticleDOI
TL;DR: The method is based on an optimality criterion motivated by the Campbell formula applied to the reciprocal intensity function and is fully nonparametric, does not require knowl- edge of higher-order moments, and is not restricted to a specific class of point process.
Abstract: We propose a new bandwidth selection method for kernel estimators of spatial point process intensity functions. The method is based on an optimality criterion motivated by the Campbell formula applied to the reciprocal intensity function. The new method is fully nonparametric, does not require knowl- edge of higher-order moments, and is not restricted to a specific class of point process. Our approach is computationally straightforward and does not require numerical approximation of integrals.

58 citations


Journal ArticleDOI
TL;DR: This paper proposed a smooth weighting for sample trimming to incorporate uncertainty arising from both design and analysis stages, which has better asymptotic linearity than the traditional trimming.
Abstract: &NA; Causal inference with observational studies often relies on the assumptions of unconfoundedness and overlap of covariate distributions in different treatment groups. The overlap assumption is violated when some units have propensity scores close to 0 or 1, so both practical and theoretical researchers suggest dropping units with extreme estimated propensity scores. However, existing trimming methods often do not incorporate the uncertainty in this design stage and restrict inference to only the trimmed sample, due to the nonsmoothness of the trimming. We propose a smooth weighting, which approximates sample trimming and has better asymptotic properties. An advantage of our estimator is its asymptotic linearity, which ensures that the bootstrap can be used to make inference for the target population, incorporating uncertainty arising from both design and analysis stages. We extend the theory to the average treatment effect on the treated, suggesting trimming samples with estimated propensity scores close to 1.

58 citations


Journal ArticleDOI
TL;DR: Symmetric Rank Covariances is a new class of multivariate nonparametric measures of dependence that generalises all of the above measures and leads naturally to multivariate extensions of the Bergsma--Dassios sign covariance.
Abstract: SummaryThe need to test whether two random vectors are independent has spawned many competing measures of dependence. We focus on nonparametric measures that are invariant under strictly increasing transformations, such as Kendall’s tau, Hoeffding’s $D$, and the Bergsma–Dassios sign covariance. Each exhibits symmetries that are not readily apparent from their definitions. Making these symmetries explicit, we define a new class of multivariate nonparametric measures of dependence that we call symmetric rank covariances. This new class generalizes the above measures and leads naturally to multivariate extensions of the Bergsma–Dassios sign covariance. Symmetric rank covariances may be estimated unbiasedly using U-statistics, for which we prove results on computational efficiency and large-sample behaviour. The algorithms we develop for their computation include, to the best of our knowledge, the first efficient algorithms for Hoeffding’s $D$ statistic in the multivariate setting.

57 citations


Journal ArticleDOI
TL;DR: A new class of Bayesian nonparametric models are proposed and an efficient posterior computational algorithm is developed that provides large prior support over the class of piecewise‐smooth, sparse, and continuous spatially varying regression coefficient functions.
Abstract: This work concerns spatial variable selection for scalar-on-image regression. We propose a new class of Bayesian nonparametric models and develop an efficient posterior computational aigorithm. The proposed soft-thresholded Gaussian process provides large prior support over the class of piecewise-smooth, sparse, and continuous spatially-varying regression coefficient functions. In addition, under some mild regularity conditions the soft-thresholded Gaussian proess prior leads to the posterior consistency for parameter estimation and variable selection for scalar-on-image regression, even when the number of predictors is larger than the sample size. The proposed method is compared to alternatives via simulation and applied to an electroen-cephalography study of alcoholism.

Journal ArticleDOI
TL;DR: This work develops a class of models that posit a correlation structure among the outcomes of a randomized experiment, and leverages these models to develop restricted randomization strategies for allocating treatment optimally, by minimizing the mean square error of the estimated average treatment effect.
Abstract: SummaryIn this paper we consider how to assign treatment in a randomized experiment in which the correlation among the outcomes is informed by a network available pre-intervention. Working within the potential outcome causal framework, we develop a class of models that posit such a correlation structure among the outcomes. We use these models to develop restricted randomization strategies for allocating treatment optimally, by minimizing the mean squared error of the estimated average treatment effect. Analytical decompositions of the mean squared error, due both to the model and to the randomization distribution, provide insights into aspects of the optimal designs. In particular, the analysis suggests new notions of balance based on specific network quantities, in addition to classical covariate balance. The resulting balanced optimal restricted randomization strategies are still design-unbiased when the model used to derive them does not hold. We illustrate how the proposed treatment allocation strategies improve on allocations that ignore the network structure.

Journal ArticleDOI
Sean Yiu1, Li Su1
TL;DR: A unified framework for constructing weights such that a set of measured pretreatment covariates is unassociated with treatment assignment after weighting is proposed, which extends to longitudinal settings.
Abstract: Weighting methods offer an approach to estimating causal treatment effects in observational studies. However, if weights are estimated by maximum likelihood, misspecification of the treatment assignment model can lead to weighted estimators with substantial bias and variance. In this paper, we propose a unified framework for constructing weights such that a set of measured pretreatment covariates is unassociated with treatment assignment after weighting. We derive conditions for weight estimation by eliminating the associations between these covariates and treatment assignment characterized in a chosen treatment assignment model after weighting. The moment conditions in covariate balancing weight methods for binary, categorical and continuous treatments in cross-sectional settings are special cases of the conditions in our framework, which extends to longitudinal settings. Simulation shows that our method gives treatment effect estimates with smaller biases and variances than the maximum likelihood approach under treatment assignment model misspecification. We illustrate our method with an application to systemic lupus erythematosus data.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a method to estimate variances of a number of Monte Carlo approximations that particle filters deliver, by keeping track of certain key features of the genealogical structure arising from resampling operations.
Abstract: SummaryThis paper concerns numerical assessment of Monte Carlo error in particle filters. We show that by keeping track of certain key features of the genealogical structure arising from resampling operations, it is possible to estimate variances of a number of Monte Carlo approximations that particle filters deliver. All our estimators can be computed from a single run of a particle filter. We establish that, as the number of particles grows, our estimators are weakly consistent for asymptotic variances of the Monte Carlo approximations and some of them are also non-asymptotically unbiased. The asymptotic variances can be decomposed into terms corresponding to each time step of the algorithm, and we show how to estimate each of these terms consistently. When the number of particles may vary over time, this allows approximation of the asymptotically optimal allocation of particle numbers.

Journal ArticleDOI
TL;DR: This article developed an estimation procedure for the optimal dynamic treatment regime over an indefinite time period and derived associated large-sample results, which can be used to estimate the optimal treatment regime in chronic disease settings and illustrate this by simulating a dataset corresponding to a cohort of patients with diabetes.
Abstract: SUMMARYExisting methods for estimating optimal dynamic treatment regimes are limited to cases where a utility function is optimized over a fixed time period We develop an estimation procedure for the optimal dynamic treatment regime over an indefinite time period and derive associated large-sample results The proposed method can be used to estimate the optimal dynamic treatment regime in chronic disease settings We illustrate this by simulating a dataset corresponding to a cohort of patients with diabetes that mimics the third wave of the National Health and Nutrition Examination Survey, and examining the performance of the proposed method in controlling the level of haemoglobin A1c

Journal ArticleDOI
TL;DR: In this paper, the Akaike information criterion is used to construct a selection region and obtain the asymptotic distribution of estimators and linear combinations thereof conditional on the selected model.
Abstract: SummaryIgnoring the model selection step in inference after selection is harmful. In this paper we study the asymptotic distribution of estimators after model selection using the Akaike information criterion. First, we consider the classical setting in which a true model exists and is included in the candidate set of models. We exploit the overselection property of this criterion in constructing a selection region, and we obtain the asymptotic distribution of estimators and linear combinations thereof conditional on the selected model. The limiting distribution depends on the set of competitive models and on the smallest overparameterized model. Second, we relax the assumption on the existence of a true model and obtain uniform asymptotic results. We use simulation to study the resulting post-selection distributions and to calculate confidence regions for the model parameters, and we also apply the method to a diabetes dataset.

Journal ArticleDOI
TL;DR: In this paper, it was shown that the asymptotic variance of estimators obtained using approximate Bayesian computation in a large-data limit can be approximated by a fixed-dimensional summary statistic that obeys a central limit theorem.
Abstract: Many statistical applications involve models for which it is difficult to evaluate the likelihood, but from which it is relatively easy to sample. Approximate Bayesian computation is a likelihood-free method for implementing Bayesian inference in such cases. We present results on the asymptotic variance of estimators obtained using approximate Bayesian computation in a large-data limit. Our key assumption is that the data is summarized by a fixed-dimensional summary statistic that obeys a central limit theorem. We prove asymptotic normality of the mean of the approximate Bayesian computation posterior. This result also shows that, in terms of asymptotic variance, we should use a summary statistic that is the same dimension as the parameter vector, p; and that any summary statistic of higher dimension can be reduced, through a linear transformation, to dimension p in a way that can only reduce the asymptotic variance of the posterior mean. We look at how the Monte Carlo error of an importance sampling algorithm that samples from the approximate Bayesian computation posterior affects the accuracy of estimators. We give conditions on the importance sampling proposal distribution such that the variance of the estimator will be the same order as that of the maximum likelihood estimator based on the summary statistics used. This suggests an iterative importance sampling algorithm, which we evaluate empirically on a stochastic volatility model.

Journal ArticleDOI
TL;DR: Asymptotic results for the regression‐adjusted version of approximate Bayesian computation introduced by Beaumont et al. (2002) are presented and it is shown that for an appropriate choice of the bandwidth, regression adjustment will lead to a posterior that, asymptotically, correctly quantifies uncertainty.
Abstract: We present asymptotic results for the regression-adjusted version of approximate Bayesian computation introduced by Beaumont et al. (2002). We show that for an appropriate choice of the bandwidth, regression adjustment will lead to a posterior that, asymptotically, correctly quantifies uncertainty. Furthermore, for such a choice of bandwidth we can implement an importance sampling algorithm to sample from the posterior whose acceptance probability tends to unity as the data sample size increases. This compares favourably to results for standard approximate Bayesian computation, where the only way to obtain a posterior that correctly quantifies uncertainty is to choose a much smaller bandwidth; one for which the acceptance probability tends to zero and hence for which Monte Carlo error will dominate.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed using the scaled lasso to perform inference for selected coefficients and the noise level simultaneously, and provided valid $p$-values and confidence intervals for coefficients after variable selection and estimates for the model-specific variance.
Abstract: SummaryThere has been much recent work on inference after model selection in situations where the noise level is known. However, the error variance is rarely known in practice and its estimation is difficult in high-dimensional settings. In this work we propose using the square-root lasso, also known as the scaled lasso, to perform inference for selected coefficients and the noise level simultaneously. The square-root lasso has the property that the choice of a reasonable tuning parameter does not depend on the noise level in the data. We provide valid $p$-values and confidence intervals for coefficients after variable selection and estimates for the model-specific variance. Our estimators perform better in simulations than other estimators of the noise variance. These results make inference after model selection significantly more applicable.

Journal ArticleDOI
TL;DR: In this paper, a testable hypothesis of compositional equivalence for the means of two latent log basis vectors is formulated and a test through the centred log-ratio transformation of the compositions is proposed.
Abstract: Summary Compositional data are ubiquitous in many scientific endeavours. Motivated by microbiome and metagenomic research, we consider a two-sample testing problem for high-dimensional compositional data and formulate a testable hypothesis of compositional equivalence for the means of two latent log basis vectors. We propose a test through the centred log-ratio transformation of the compositions. The asymptotic null distribution of the test statistic is derived and its power against sparse alternatives is investigated. A modified test for paired samples is also considered. Simulations show that the proposed tests can be significantly more powerful than tests that are applied to the raw and log-transformed compositions. The usefulness of our tests is illustrated by applications to gut microbiome composition in obesity and Crohn’s disease.

Journal ArticleDOI
TL;DR: This work highlights the weaknesses of widely-used penalized M-estimators, proposes a robust penalized quasilikelihood estimator, and shows that it enjoys oracle properties in high dimensions and is stable in a neighbourhood of the model.
Abstract: Generalized linear models are popular for modelling a large variety of data. We consider variable selection through penalized methods by focusing on resistance issues in the presence of outlying data and other deviations from assumptions.We highlight the weaknesses of widely-used penalized M-estimators, propose a robust penalized quasilikelihood estimator, and show that it enjoys oracle properties in high dimensions and is stable in a neighbourhood of the model. We illustrate its finite-sample performance on simulated and real data.

Journal ArticleDOI
TL;DR: This paper proposes a convex formulation for fitting sparse sliced inverse regression in high dimensions and estimates the subspace of the linear combinations of the covariates directly and performs variable selection simultaneously, and establishes an upper bound on the sub space distance between the estimated and the true subspaces.
Abstract: SummarySliced inverse regression is a popular tool for sufficient dimension reduction, which replaces covariates with a minimal set of their linear combinations without loss of information on the conditional distribution of the response given the covariates. The estimated linear combinations include all covariates, making results difficult to interpret and perhaps unnecessarily variable, particularly when the number of covariates is large. In this paper, we propose a convex formulation for fitting sparse sliced inverse regression in high dimensions. Our proposal estimates the subspace of the linear combinations of the covariates directly and performs variable selection simultaneously. We solve the resulting convex optimization problem via the linearized alternating direction methods of multiplier algorithm, and establish an upper bound on the subspace distance between the estimated and the true subspaces. Through numerical studies, we show that our proposal is able to identify the correct covariates in the high-dimensional setting.

Journal ArticleDOI
TL;DR: In this article, the authors use regression adjustment to correct for persistent covariate imbalances after randomization, and present two regression-assisted estimators for the sample average treatment effect in paired experiments.
Abstract: SummaryIn paired randomized experiments, individuals in a given matched pair may differ on prognostically important covariates despite the best efforts of practitioners. We examine the use of regression adjustment to correct for persistent covariate imbalances after randomization, and present two regression-assisted estimators for the sample average treatment effect in paired experiments. Using the potential outcomes framework, we prove that these estimators are consistent for the sample average treatment effect under mild regularity conditions even if the regression model is improperly specified, and describe how asymptotically conservative confidence intervals can be constructed. We demonstrate that the variances of the regression-assisted estimators are no larger than that of the standard difference-in-means estimator asymptotically, and illustrate the proposed methods by simulation. The analysis does not require a superpopulation model, a constant treatment effect, or the truth of the regression model, and hence provides inference for the sample average treatment effect with the potential to increase power without unrealistic assumptions.

Journal ArticleDOI
TL;DR: Forastiere et al. as mentioned in this paper show how sequential ignorability extrapolates from observable potential outcomes to a priori counterfactuals, and propose alternative weaker principal ignorability-type assumptions.
Abstract: Author(s): Forastiere, L; Mattei, A; Ding, P | Abstract: In causal mediation analysis, the definitions of the natural direct and indirect effects involve potential outcomes that can never be observed, so-called a priori counterfactuals. This conceptual challenge translates into issues in identification, which requires strong and often unverifiable assumptions, including sequential ignorability. Alternatively, we can deal with post-treatment variables using the principal stratification framework, where causal effects are defined as comparisons of observable potential outcomes.We establish a novel bridge between mediation analysis and principal stratification, which helps to clarify and weaken the commonly used identifying assumptions for natural direct and indirect effects. Using principal stratification, we show how sequential ignorability extrapolates from observable potential outcomes to a priori counterfactuals, and propose alternative weaker principal ignorability-type assumptions. We illustrate the key concepts using a clinical trial.

Journal ArticleDOI
TL;DR: The matrix multivariate auto‐distance covariance and correlation functions for time series are introduced, their interpretation and consistent estimators for practical implementation are discussed and a test of the independent and identically distributed hypothesis is developed.
Abstract: We introduce the matrix multivariate auto-distance covariance and correlation functions for time series, discuss their interpretation and develop consistent estimators for practical implementation. We also develop a test of the independent and identically distributed hypothesis for multivariate time series data and show that it performs better than the multivariate Ljung–Box test. We discuss computational aspects and present a data example to illustrate the method.

Journal ArticleDOI
TL;DR: In this paper, a wild residual bootstrap procedure for unpenalized quantile regression is shown to be asymptotically valid for approximating the distribution of a penalized quantiles regression estimator with an adaptive $L_1$ penalty.
Abstract: SUMMARYWe consider a heteroscedastic regression model in which some of the regression coefficients are zero but it is not known which ones. Penalized quantile regression is a useful approach for analysing such data. By allowing different covariates to be relevant for modelling conditional quantile functions at different quantile levels, it provides a more complete picture of the conditional distribution of a response variable than mean regression. Existing work on penalized quantile regression has been mostly focused on point estimation. Although bootstrap procedures have recently been shown to be effective for inference for penalized mean regression, they are not directly applicable to penalized quantile regression with heteroscedastic errors. We prove that a wild residual bootstrap procedure for unpenalized quantile regressionis asymptotically valid for approximating the distribution of a penalized quantile regression estimator with an adaptive $L_1$ penalty and that a modified version can be used to approximate the distribution of a $L_1$-penalized quantile regression estimator. The new methods do not require estimation of the unknown error density function. We establish consistency, demonstrate finite-sample performance, andillustrate the applications on a real data example.

Journal ArticleDOI
TL;DR: This article constructed confidence intervals that have a constant frequentist coverage rate and that make use of information about across-group heterogeneity, resulting in constant-coverage intervals that are narrower than standard $t$-intervals on average across groups.
Abstract: SUMMARYCommonly used interval procedures for multigroup data attain their nominal coverage rates across a population of groups on average, but their actual coverage rate for a given group will be above or below the nominal rate, depending on the group mean. While correct coverage for a given group can be achieved with a standard $t$-interval, this approach is not adaptive to the available information about the distribution of group-specific means. In this article we construct confidence intervals that have a constant frequentist coverage rate and that make use of information about across-group heterogeneity, resulting in constant-coverage intervals that are narrower than standard $t$-intervals on average across groups. Such intervals are constructed by inverting biased Bayes-optimal tests for the mean of each group, where the prior distribution for a given group is estimated with data from the other groups.

Journal ArticleDOI
TL;DR: Replicability analysis aims to identify the overlapping signals across independent studies that examine the same features by developing hypothesis testing procedures that first select the promising features from each of two studies separately.
Abstract: Replicability analysis aims to identify the findings that replicated across independent studies that examine the same features. We provide powerful novel replicability analysis procedures for two studies for FWER and for FDR control on the replicability claims. The suggested procedures first select the promising features from each study solely based on that study, and then test for replicability only the features that were selected in both studies. We incorporate the plug-in estimates of the fraction of null hypotheses in one study among the selected hypotheses by the other study. Since the fraction of nulls in one study among the selected features from the other study is typically small, the power gain can be remarkable. We provide theoretical guarantees for the control of the appropriate error rates, as well as simulations that demonstrate the excellent power properties of the suggested procedures. We demonstrate the usefulness of our procedures on real data examples from two application fields: behavioural genetics and microarray studies.

Journal ArticleDOI
TL;DR: The authors proposed a Bayesian approach which uses the sampling distribution of a summary statistic to derive the posterior distribution of the parameters of interest, which is directly applicable to combining information from two independent surveys and to calibration estimation in survey sampling.
Abstract: Summary Statistical inference with complex survey data is challenging because the sampling design can be informative, and ignoring it can produce misleading results. Current methods of Bayesian inference under complex sampling assume that the sampling design is noninformative for the specified model. In this paper, we propose a Bayesian approach which uses the sampling distribution of a summary statistic to derive the posterior distribution of the parameters of interest. Asymptotic properties of the method are investigated. It is directly applicable to combining information from two independent surveys and to calibration estimation in survey sampling. A simulation study confirms that it can provide valid estimation under informative sampling. We apply it to a measurement error problem using data from the Korean Longitudinal Study of Aging.

Journal ArticleDOI
TL;DR: In this article, the authors provide an explanation of overfitting for a class of model selection criteria in a regression context, in terms of the dependence of the selected submodel's dependence on the data.
Abstract: Summary In a regression context, when the relevant subset of explanatory variables is uncertain, it is common to use a data-driven model selection procedure. Classical linear model theory, applied naively to the selected submodel, may not be valid because it ignores the selected submodel’s dependence on the data. We provide an explanation of this phenomenon, in terms of overfitting, for a class of model selection criteria.