scispace - formally typeset
Search or ask a question

Showing papers on "Sampling distribution published in 2017"


Posted Content
01 Jan 2017
TL;DR: This article developed a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm and showed that causal forests are pointwise consistent for the true treatment effect, and have an asymptotic Gaussian and centered sampling distribution.
Abstract: Many scientific and engineering challenges--ranging from personalized medicine to customized marketing recommendations--require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

485 citations


Posted Content
TL;DR: In this paper, a sampling distribution is learned from demonstrations, and then used to bias sampling for non-uniform sampling, allowing sample generation from the latent space conditioned on the specific planning problem.
Abstract: A defining feature of sampling-based motion planning is the reliance on an implicit representation of the state space, which is enabled by a set of probing samples. Traditionally, these samples are drawn either probabilistically or deterministically to uniformly cover the state space. Yet, the motion of many robotic systems is often restricted to "small" regions of the state space, due to, for example, differential constraints or collision-avoidance constraints. To accelerate the planning process, it is thus desirable to devise non-uniform sampling strategies that favor sampling in those regions where an optimal solution might lie. This paper proposes a methodology for non-uniform sampling, whereby a sampling distribution is learned from demonstrations, and then used to bias sampling. The sampling distribution is computed through a conditional variational autoencoder, allowing sample generation from the latent space conditioned on the specific planning problem. This methodology is general, can be used in combination with any sampling-based planner, and can effectively exploit the underlying structure of a planning problem while maintaining the theoretical guarantees of sampling-based approaches. Specifically, on several planning problems, the proposed methodology is shown to effectively learn representations for the relevant regions of the state space, resulting in an order of magnitude improvement in terms of success rate and convergence to the optimal cost.

114 citations


Journal ArticleDOI
TL;DR: The authors used central limit theorems for randomization-based causal analyses of experimental data, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment.
Abstract: Frequentists’ inference often delivers point estimators associated with confidence intervals or sets for parameters of interest. Constructing the confidence intervals or sets requires understanding the sampling distributions of the point estimators, which, in many but not all cases, are related to asymptotic Normal distributions ensured by central limit theorems. Although previous literature has established various forms of central limit theorems for statistical inference in super population models, we still need general and convenient forms of central limit theorems for some randomization-based causal analyses of experimental data, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment. We use central limit theorems for sample surveys and rank statistics to establish general forms of the finite population central limit theorems that are particularly useful for proving asymptotic distributions of randomization tests under th...

112 citations


Journal ArticleDOI
TL;DR: This algorithm is based on the provably efficient hit‐and‐run random walk and crucially uses a preprocessing step to round the anisotropic flux set, enabling reliable and tractable sampling of genome‐scale biochemical networks.
Abstract: In constraint-based metabolic modelling, physical and biochemical constraints define a polyhedral convex set of feasible flux vectors. Uniform sampling of this set provides an unbiased characterization of the metabolic capabilities of a biochemical network. However, reliable uniform sampling of genome-scale biochemical networks is challenging due to their high dimensionality and inherent anisotropy. Here, we present an implementation of a new sampling algorithm, coordinate hit-and-run with rounding (CHRR). This algorithm is based on the provably efficient hit-and-run random walk and crucially uses a preprocessing step to round the anisotropic flux set. CHRR provably converges to a uniform stationary sampling distribution. We apply it to metabolic networks of increasing dimensionality. We show that it converges several times faster than a popular artificial centering hit-and-run algorithm, enabling reliable and tractable sampling of genome-scale biochemical networks.

78 citations


Journal ArticleDOI
TL;DR: An adaptive sampling approach under the bias-variance decomposition framework that sequentially selects new points by maximizing an expected prediction error criterion that considers both the bias and variance information is proposed.

77 citations


Journal ArticleDOI
TL;DR: Every accuracy estimate resulting from the instances in a fold or a data set is considered as a point estimators instead of a fixed value to derive the sampling distribution of the point estimator for comparing the performance of two classification algorithms.

50 citations


Journal ArticleDOI
TL;DR: This study develops a variables MDS sampling plan for lot sentencing based on the advanced process capability index, which was developed by combining the merits of the yield- based index and loss-based index.
Abstract: Acceptance sampling plans have been utilised predominantly for the inspection of outgoing and incoming lots; these plans provide effective rules to vendors and buyers for making decisions on product acceptance or rejection. Multiple dependent state (MDS) sampling plans have been developed for lot sentencing and are shown to be more efficient than traditional single sampling plans. The decision criteria of MDS sampling plans are based on sample information not only from the current lot but also from preceding lots. In this study, we develop a variables MDS sampling plan for lot sentencing based on the advanced process capability index, which was developed by combining the merits of the yield-based index and loss-based index. The operating characteristic function of the developed plan is derived based on the exact sampling distribution. The determination of plan parameters is formulated as an optimisation model with non-linear constraints, where the objective is to minimise the sample size required for insp...

42 citations


Posted Content
TL;DR: The theory and results show that inference based on GP regression tends to be conservative; when the prior is under-smoothed, the resulting credible intervals and bands have minimax-optimal sizes, with their frequentist coverage converging to a non-degenerate value between their nominal level and one.
Abstract: Gaussian process (GP) regression is a powerful interpolation technique due to its flexibility in capturing non-linearity. In this paper, we provide a general framework for understanding the frequentist coverage of point-wise and simultaneous Bayesian credible sets in GP regression. As an intermediate result, we develop a Bernstein von-Mises type result under supremum norm in random design GP regression. Identifying both the mean and covariance function of the posterior distribution of the Gaussian process as regularized $M$-estimators, we show that the sampling distribution of the posterior mean function and the centered posterior distribution can be respectively approximated by two population level GPs. By developing a comparison inequality between two GPs, we provide exact characterization of frequentist coverage probabilities of Bayesian point-wise credible intervals and simultaneous credible bands of the regression function. Our results show that inference based on GP regression tends to be conservative; when the prior is under-smoothed, the resulting credible intervals and bands have minimax-optimal sizes, with their frequentist coverage converging to a non-degenerate value between their nominal level and one. As a byproduct of our theory, we show that the GP regression also yields minimax-optimal posterior contraction rate relative to the supremum norm, which provides a positive evidence to the long standing problem on optimal supremum norm contraction rate in GP regression.

40 citations


Proceedings Article
06 Aug 2017
TL;DR: This work employs a bandit optimization procedure that "learns" probabilities for sampling coordinates or examples in (non-smooth) optimization problems, allowing it to guarantee performance close to that of the optimal stationary sampling distribution.
Abstract: Standard forms of coordinate and stochastic gradient methods do not adapt to structure in data; their good behavior under random sampling is predicated on uniformity in data. When gradients in certain blocks of features (for coordinate descent) or examples (for SGD) are larger than others, there is a natural structure that can be exploited for quicker convergence. Yet adaptive variants often suffer nontrivial computational overhead. We present a framework that discovers and leverages such structural properties at a low computational cost. We employ a bandit optimization procedure that "learns" probabilities for sampling coordinates or examples in (non-smooth) optimization problems, allowing us to guarantee performance close to that of the optimal stationary sampling distribution. When such structures exist, our algorithms achieve tighter convergence guarantees than their non-adaptive counterparts, and we complement our analysis with experiments on several datasets.

39 citations


Proceedings Article
01 Dec 2017
TL;DR: It is shown that coordinate-descent and stochastic gradient descent can enjoy significant a speed-up under the novel sampling scheme, and can efficiently be computed - in many applications at negligible extra cost.
Abstract: Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants -- using importance values defined by the complete gradient information which changes during optimization -- enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the \emph{best sampling} with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed -- in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing.

39 citations


Journal ArticleDOI
TL;DR: In this paper, the authors revisited the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size, and obtained a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments.
Abstract: Heavy-tailed errors impair the accuracy of the least squares estimate, which can be spoiled by a single grossly outlying observation. As argued in the seminal work of Peter Huber in 1973 [{\it Ann. Statist.} {\bf 1} (1973) 799--821], robust alternatives to the method of least squares are sorely needed. To achieve robustness against heavy-tailed sampling distributions, we revisit the Huber estimator from a new perspective by letting the tuning parameter involved diverge with the sample size. In this paper, we develop nonasymptotic concentration results for such an adaptive Huber estimator, namely, the Huber estimator with the tuning parameter adapted to sample size, dimension, and the variance of the noise. Specifically, we obtain a sub-Gaussian-type deviation inequality and a nonasymptotic Bahadur representation when noise variables only have finite second moments. The nonasymptotic results further yield two conventional normal approximation results that are of independent interest, the Berry-Esseen inequality and Cramer-type moderate deviation. As an important application to large-scale simultaneous inference, we apply these robust normal approximation results to analyze a dependence-adjusted multiple testing procedure for moderately heavy-tailed data. It is shown that the robust dependence-adjusted procedure asymptotically controls the overall false discovery proportion at the nominal level under mild moment conditions. Thorough numerical results on both simulated and real datasets are also provided to back up our theory.

Journal ArticleDOI
TL;DR: This paper derives explicit data-driven bounds on the Wasserstein distance between the posterior distribution based on a given prior and the no-prior posterior based uniquely on the sampling distribution, confirming the well-known fact that with well-identified parameters and large sample sizes, reasonable choices of prior distributions will have only minor effects on posterior inferences if the data are benign.
Abstract: In this paper, we propose tight upper and lower bounds for the Wasserstein distance between any two univariate continuous distributions with probability densities $p_{1}$ and $p_{2}$ having nested supports. These explicit bounds are expressed in terms of the derivative of the likelihood ratio $p_{1}/p_{2}$ as well as the Stein kernel $\tau_{1}$ of $p_{1}$. The method of proof relies on a new variant of Stein’s method which manipulates Stein operators. We give several applications of these bounds. Our main application is in Bayesian statistics: we derive explicit data-driven bounds on the Wasserstein distance between the posterior distribution based on a given prior and the no-prior posterior based uniquely on the sampling distribution. This is the first finite sample result confirming the well-known fact that with well-identified parameters and large sample sizes, reasonable choices of prior distributions will have only minor effects on posterior inferences if the data are benign.

Proceedings Article
02 Dec 2017
TL;DR: This article proposed a new class of kernel based sampling methods and developed an efficient sampling algorithm, which can adapt to the model as it is trained, thus resulting in low bias, and empirically study the trade-off of bias, sampling distribution and sample size.
Abstract: Softmax is the most commonly used output function for multiclass problems and is widely used in areas such as vision, natural language processing, and recommendation. A softmax model has linear costs in the number of classes which makes it too expensive for many real-world problems. A common approach to speed up training involves sampling only some of the classes at each training step. It is known that this method is biased and that the bias increases the more the sampling distribution deviates from the output distribution. Nevertheless, almost any recent work uses simple sampling distributions that require a large sample size to mitigate the bias. In this work, we propose a new class of kernel based sampling methods and develop an efficient sampling algorithm. Kernel based sampling adapts to the model as it is trained, thus resulting in low bias. Kernel based sampling can be easily applied to many models because it relies only on the model's last hidden layer. We empirically study the trade-off of bias, sampling distribution and sample size and show that kernel based sampling results in low bias with few samples.

Posted Content
TL;DR: In this paper, the loss-likelihood bootstrap is used to calibrate the general Bayesian posterior by matching asymptotic Fisher information, which is a way of updating prior belief distributions without needing to construct a global probability model, yet requires the calibration of two forms of loss function.
Abstract: In this paper we revisit the weighted likelihood bootstrap, a method that generates samples from an approximate Bayesian posterior of a parametric model. We show that the same method can be derived, without approximation, under a Bayesian nonparametric model with the parameter of interest defined as minimising an expected negative log-likelihood under an unknown sampling distribution. This interpretation enables us to extend the weighted likelihood bootstrap to posterior sampling for parameters minimizing an expected loss. We call this method the loss-likelihood bootstrap. We make a connection between this and general Bayesian updating, which is a way of updating prior belief distributions without needing to construct a global probability model, yet requires the calibration of two forms of loss function. The loss-likelihood bootstrap is used to calibrate the general Bayesian posterior by matching asymptotic Fisher information. We demonstrate the methodology on a number of examples.

Journal ArticleDOI
14 Jul 2017-PLOS ONE
TL;DR: Based on the PTE and the seven resampling methods, it is consistently found that changes in crude oil cause inflation conditioning on money supply in the post-1986 period, however this relationship cannot be explained on the basis of traditional cost-push mechanisms.
Abstract: Different resampling methods for the null hypothesis of no Granger causality are assessed in the setting of multivariate time series, taking into account that the driving-response coupling is conditioned on the other observed variables. As appropriate test statistic for this setting, the partial transfer entropy (PTE), an information and model-free measure, is used. Two resampling techniques, time-shifted surrogates and the stationary bootstrap, are combined with three independence settings (giving a total of six resampling methods), all approximating the null hypothesis of no Granger causality. In these three settings, the level of dependence is changed, while the conditioning variables remain intact. The empirical null distribution of the PTE, as the surrogate and bootstrapped time series become more independent, is examined along with the size and power of the respective tests. Additionally, we consider a seventh resampling method by contemporaneously resampling the driving and the response time series using the stationary bootstrap. Although this case does not comply with the no causality hypothesis, one can obtain an accurate sampling distribution for the mean of the test statistic since its value is zero under H0. Results indicate that as the resampling setting gets more independent, the test becomes more conservative. Finally, we conclude with a real application. More specifically, we investigate the causal links among the growth rates for the US CPI, money supply and crude oil. Based on the PTE and the seven resampling methods, we consistently find that changes in crude oil cause inflation conditioning on money supply in the post-1986 period. However this relationship cannot be explained on the basis of traditional cost-push mechanisms.

Journal ArticleDOI
TL;DR: A Bayesian nonparametric procedure that leads to a tractable, explicit and analytic quantification of the relative evidence for dependence vs independence and uses Polya tree priors on the space of probability measures to embedded within a decision theoretic test for dependence.
Abstract: Nonparametric and nonlinear measures of statistical dependence between pairs of random variables are important tools in modern data analysis. In particular the emergence of large data sets can now support the relaxation of linearity assumptions implicit in traditional association scores such as correlation. Here we describe a Bayesian nonparametric procedure that leads to a tractable, explicit and analytic quantification of the relative evidence for dependence vs independence. Our approach uses Polya tree priors on the space of probability measures which can then be embedded within a decision theoretic test for dependence. Polya tree priors can accommodate known uncertainty in the form of the underlying sampling distribution and provides an explicit posterior probability measure of both dependence and independence. Well known advantages of having an explicit probability measure include: easy comparison of evidence across different studies; encoding prior information; quantifying changes in dependence across different experimental conditions, and the integration of results within formal decision analysis.

Posted Content
TL;DR: In this paper, the authors proposed an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient, and showed that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speedup under the proposed scheme.
Abstract: Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants - using importance values defined by the complete gradient information which changes during optimization - enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efficient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the best sampling with respect to the given bounds, (ii) always better than uniform sampling and fixed importance sampling and (iii) can efficiently be computed - in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy significant a speed-up under the novel scheme. The proven efficiency of the proposed sampling is verified by extensive numerical testing.

Journal ArticleDOI
TL;DR: Evaluated the operating characteristics of the Huggins logit-normal estimator through computer simulations, using Gaussian–Hermite quadrature to model individual encounter heterogeneity and confidence interval coverage of N appears close to the nominal 95% expected when the estimator is not biased.
Abstract: Estimation of population abundance is a common problem in wildlife ecology and management. Capture-mark-reencounter (CMR) methods using marked animals are a standard approach, particularly in recent history with the development of innovative methods of marking using camera traps or DNA samples. However, estimates of abundance from multiple encounters of marked individuals are biased low when individual heterogeneity of encounter probabilities is not accounted for in the estimator. We evaluated the operating characteristics of the Huggins logit-normal estimator through computer simulations, using Gaussian–Hermite quadrature to model individual encounter heterogeneity. We simulated individual encounter data following a factorial design with 2 levels of sampling occasions (t = 5, 10), 3 levels of abundance (N = 100, 500, 1,000), 4 levels of median detection probabilities (p = 0.1, 0.2, 0.4, 0.6) for each sampling occasion (on the probability scale), and 4 levels of individual heterogeneity (σp = 0, 0.5, 1, 2; on the logit normal scale), resulting in a design space consisting of 96 simulation scenarios (2 × 3 × 4 × 4). For each scenario, we performed 1,000 simulations using the Huggins estimators Mt, M0, MtRE, and M0RE, where the RE subscript corresponds to the random effects model. As expected, the Mt and M0 estimators were biased when individual heterogeneity was present but unbiased for σp = 0 data. The estimators for MtRE and M0RE were biased high for N = 100 and median p ≤ 0.2 but showed little bias elsewhere. The bias is attributed to the occasional sets of data that result in a low overall detection probability and a resulting highly skewed sampling distribution of Nˆ. This result is confirmed in that the median of the sampling distributions was only slightly biased high. The random effects estimators performed poorly for σp = 0 data, mainly because a log link function forces the estimate of σp > 0. However, the Fletcher cˆ statistic provided useful evidence to evaluate σp > 0, as did likelihood ratio tests of the null hypothesis σp = 0. Generally, confidence interval coverage of N appears close to the nominal 95% expected when the estimator is not biased. © 2017 The Wildlife Society.

Journal ArticleDOI
TL;DR: In this article, a variable multiple dependent state sampling plan (VMDSSP) for unilateral specification limit based on a one-sided capability index is proposed. But the plan parameters are determined by minimizing the average sample number while satisfying the quality levels demanded by both the producer and the consumer.
Abstract: Acceptance sampling plan has been considered as one of most practical tools for quality assurance applications. While various types of acceptance sampling plans have been developed for different purposes, single acceptance sampling plan is the most popular because it is simple to administrate. However, a new concept called multiple dependent state sampling has gained the attention of scholars in recent years. The underlying principle is that the acceptance of a submitted lot should not only depend on the quality of the current lot but also consider the quality of the preceding lots. This research develops a variables multiple dependent state sampling plan (VMDSSP) for unilateral specification limit based on a one-sided capability index. The operating characteristic (OC) curve is prepared based on the exact sampling distribution. The plan parameters are determined by minimizing the average sample number while satisfying the quality levels demanded by both the producer and the consumer. The performa...

Journal ArticleDOI
TL;DR: The assumption of normality is explored, an assumption essential to the meaningful interpretation of a t test, which is to bootstrap the sample mean, the difference between sample means, or t itself.
Abstract: Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This twelfth installment of Explorations in Statistics explores the assu...

Journal ArticleDOI
TL;DR: In this paper, a variable sampling plan is developed for resubmitted lots based on process capability index and Bayesian approach, and an optimization model for determining the decision parameters of developed sampling plan with regards to the constraints related to the risk of consumer and producer is presented.
Abstract: Acceptance sampling plans are applied for quality inspection of products. Among the design approaches of sampling plan, the most important one is to use process capability indices in order to improve the quality of manufacturing processes and the quality inspection of products. But, selection of estimators of process capability index and their sampling distribution is very important. Bayesian statistical technique can be used to obtain the sampling distribution. In this paper, a variable sampling plan is developed for resubmitted lots based on process capability index and Bayesian approach. In the proposed sampling plan, lots are inspected several times depending on the quality level of the process. In addition, this paper presents an optimization model for determining the decision parameters of developed sampling plan with regards to the constraints related to the risk of consumer and producer. Two comparison studied have been done including: First, the methods of double sapling plan (DSP), multiple dependent state (MDS) sampling plan, and repetitive group sampling (RGS) plan are elaborated, and also in order to comparing developed sampling plans, an expected number of products as average sample number (ASN) is used for different developed plans; second, a comparison study between Bayesian approach and exact probability distribution is carried out and their results are analyzed. It is observed that the ASN values of MDS sampling plan is less than ASN values of other methods, and also the ASN values of different variable sampling plans based on Bayesian approach is less than ASN values obtained using exact approach.

Posted Content
TL;DR: In this article, the authors formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem, and derive the associated Bellman equation using value function approximation.
Abstract: Under a Bayesian framework, we formulate the fully sequential sampling and selection decision in statistical ranking and selection as a stochastic control problem, and derive the associated Bellman equation. Using value function approximation, we derive an approximately optimal allocation policy. We show that this policy is not only computationally efficient but also possesses both one-step-ahead and asymptotic optimality for independent normal sampling distributions. Moreover, the proposed allocation policy is easily generalizable in the approximate dynamic programming paradigm.

Journal ArticleDOI
TL;DR: The results of 2 survey experiments support the existence of a cliff effect at p = .05 and suggest that researchers tend to be more likely to recommend submission of an article as the level of statistical significance increases beyond this p level.
Abstract: p-curves provide a useful window for peeking into the file drawer in a way that might reveal p-hacking (Simonsohn, Nelson, & Simmons, 2014a). The properties of p-curves are commonly investigated by computer simulations. On the basis of these simulations, it has been proposed that the skewness of this curve can be used as a diagnostic tool to decide whether the significant p values within a certain domain of research suggest the presence of p-hacking or actually demonstrate that there is a true effect. Here we introduce a rigorous mathematical approach that allows the properties of p-curves to be examined without simulations. This approach allows the computation of a p-curve for any statistic whose sampling distribution is known and thereby allows a thorough evaluation of its properties. For example, it shows under which conditions p-curves would exhibit the shape of a monotone decreasing function. In addition, we used weighted distribution functions to analyze how 2 different types of publication bias (i.e., cliff effects and gradual publication bias) influence the shapes of p-curves. The results of 2 survey experiments with more than 1,000 participants support the existence of a cliff effect at p = .05 and also suggest that researchers tend to be more likely to recommend submission of an article as the level of statistical significance increases beyond this p level. This gradual bias produces right-skewed p-curves mimicking the existence of real effects even when no such effects are actually present. (PsycINFO Database Record

Journal ArticleDOI
01 Jul 2017-Genetics
TL;DR: It is shown how using confidence intervals from sampling distributions of genetic eigenvalues without reference to the Tracy–Widom distribution is insufficient protection against mistaking sampling error as genetic variance, particularly when eigen values are small.
Abstract: The distribution of genetic variance in multivariate phenotypes is characterized by the empirical spectral distribution of the eigenvalues of the genetic covariance matrix. Empirical estimates of genetic eigenvalues from random effects linear models are known to be overdispersed by sampling error, where large eigenvalues are biased upward, and small eigenvalues are biased downward. The overdispersion of the leading eigenvalues of sample covariance matrices have been demonstrated to conform to the Tracy-Widom (TW) distribution. Here we show that genetic eigenvalues estimated using restricted maximum likelihood (REML) in a multivariate random effects model with an unconstrained genetic covariance structure will also conform to the TW distribution after empirical scaling and centering. However, where estimation procedures using either REML or MCMC impose boundary constraints, the resulting genetic eigenvalues tend not be TW distributed. We show how using confidence intervals from sampling distributions of genetic eigenvalues without reference to the TW distribution is insufficient protection against mistaking sampling error as genetic variance, particularly when eigenvalues are small. By scaling such sampling distributions to the appropriate TW distribution, the critical value of the TW statistic can be used to determine if the magnitude of a genetic eigenvalue exceeds the sampling error for each eigenvalue in the spectral distribution of a given genetic covariance matrix.

Book
06 Apr 2017
TL;DR: This chapter discusses statistical methods and empirical research, which involves linking positive theories and data-generating processes, and its applications in science, medicine and public policy.
Abstract: 1. Introduction 2. Descriptive statistics: data and information 3. Observable data and data-generating processes 4. Probability theory: basic properties of data-generating processes 5. Expectation and moments: summaries of data-generating processes 6. Probability and models: linking positive theories and data-generating processes 7. Sampling distributions: linking data-generating processes and observable data 8. Hypothesis testing: assessing claims about the data-generating process 9. Estimation: recovering properties of the data-generating process 10. Causal inference: inferring causation from correlation Afterword: statistical methods and empirical research.

Journal ArticleDOI
TL;DR: This paper proposes an approach aimed at systematic accuracy estimation of quantities provided by end-user devices of a crowd-based sensing system, obtained thanks to the combination of statistical bootstrap with uncertainty propagation techniques, leading to a consistent and technically sound methodology.
Abstract: The diffusion of mobile devices equipped with sensing, computation, and communication capabilities is opening unprecedented possibilities for high-resolution, spatio-temporal mapping of several phenomena. This novel data generation, collection, and processing paradigm, termed crowdsensing, lays upon complex, distributed cyberphysical systems. Collective data gathering from heterogeneous, spatially distributed devices inherently raises the question of how to manage different quality levels of contributed data. In order to extract meaningful information, it is, therefore, desirable to the introduction of effective methods for evaluating the quality of data. In this paper, we propose an approach aimed at systematic accuracy estimation of quantities provided by end-user devices of a crowd-based sensing system. This is obtained thanks to the combination of statistical bootstrap with uncertainty propagation techniques, leading to a consistent and technically sound methodology. Uncertainty propagation provides a formal framework for combining uncertainties, resulting from different quantities influencing a given measurement activity. Statistical bootstrap enables the characterization of the sampling distribution of a given statistics without any prior assumption on the type of statistical distributions behind the data generation process. The proposed approach is evaluated on synthetic benchmarks and on a real world case study. Cross-validation experiments show that confidence intervals computed by means of the presented technique show a maximum 1.5% variation with respect to interval widths computed by means of controlled standard Monte Carlo methods, under a wide range of operating conditions. In general, experimental results confirm the suitability and validity of the introduced methodology.

Journal ArticleDOI
TL;DR: The authors used randomization inference with historical weather patterns from 73 years as potential randomizations to estimate the variance of the effect of rainfall on voter turnout in presidential elections in the United States, and compared the estimated average treatment effect to a sampling distribution of estimates under the sharp null hypothesis of no effect.
Abstract: Many recent papers in political science and economics use rainfall as a strategy to facilitate causal inference. Rainfall shocks are as-if randomly assigned, but the assignment of rainfall by county is highly correlated across space. Since clustered assignment does not occur within well-defined boundaries, it is challenging to estimate the variance of the effect of rainfall on political outcomes. I propose using randomization inference with historical weather patterns from 73 years as potential randomizations. I replicate the influential work on rainfall and voter turnout in presidential elections in the United States by Gomez, Hansford, and Krause (2007) and compare the estimated average treatment effect (ATE) to a sampling distribution of estimates under the sharp null hypothesis of no effect. The alternate randomizations are random draws from national rainfall patterns on election and would-be election days, which preserve the clustering in treatment assignment and eliminate the need to simulate weather patterns or make assumptions about unit boundaries for clustering. I find that the effect of rainfall on turnout is subject to greater sampling variability than previously estimated using conventional standard errors.

Journal ArticleDOI
TL;DR: In this article, a new limit theory is developed for co-moving systems with explosive processes, connecting continuous and discrete time formulations, using double asymptotics with infill (as the sampling interval tends to zero).

Posted Content
TL;DR: Two new bootstraps for exchangeable random graphs are introduced that accurately approximate the sampling distributions of motif densities, i.e., of the normalized counts of the number of times fixed subgraphs appear in the network.
Abstract: We introduce two new bootstraps for exchangeable random graphs. One, the "empirical graphon", is based purely on resampling, while the other, the "histogram stochastic block model", is a model-based "sieve" bootstrap. We show that both of them accurately approximate the sampling distributions of motif densities, i.e., of the normalized counts of the number of times fixed subgraphs appear in the network. These densities characterize the distribution of (infinite) exchangeable networks. Our bootstraps therefore give, for the first time, a valid quantification of uncertainty in inferences about fundamental network statistics, and so of parameters identifiable from them.

Journal ArticleDOI
TL;DR: The use of parametric bootstrap methods are proposed to investigate the finite sample distribution of the maximum likelihood estimator for the parameter vector of a stochastic mortality model and the LRT is applied to the cohort effects estimated from observed mortality rates for females in England and Wales and males in Scotland.
Abstract: We propose the use of parametric bootstrap methods to investigate the finite sample distribution of the maximum likelihood estimator for the parameter vector of a stochastic mortality model. Particular emphasis is placed on the effect that the size of the underlying population has on the distribution of the MLE in finite samples, and on the dependency structure of the resulting estimator: that is, the dependencies between estimators for the age, period and cohort effects in our model. In addition, we study the distribution of a likelihood ratio test statistic where we test a null hypothesis about the true parameters in our model. Finally, we apply the LRT to the cohort effects estimated from observed mortality rates for females in England and Wales and males in Scotland.