scispace - formally typeset
Search or ask a question

Showing papers on "Nonparametric statistics published in 2016"


Journal ArticleDOI
TL;DR: It is found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%.
Abstract: The most widely used task functional magnetic resonance imaging (fMRI) analyses use parametric statistical methods that depend on a variety of assumptions. In this work, we use real resting-state data and a total of 3 million random task group analyses to compute empirical familywise error rates for the fMRI software packages SPM, FSL, and AFNI, as well as a nonparametric permutation method. For a nominal familywise error rate of 5%, the parametric statistical methods are shown to be conservative for voxelwise inference and invalid for clusterwise inference. Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape. By comparison, the nonparametric permutation test is found to produce nominal results for voxelwise as well as clusterwise inference. These findings speak to the need of validating the statistical methods being used in the field of neuroimaging.

2,946 citations


MonographDOI
01 Jan 2016
TL;DR: This chapter discusses nonparametric statistical models, function spaces and approximation theory, and the minimax paradigm, which aims to provide a model for adaptive inference oflihood-based procedures.
Abstract: 1. Nonparametric statistical models 2. Gaussian processes 3. Empirical processes 4. Function spaces and approximation theory 5. Linear nonparametric estimators 6. The minimax paradigm 7. Likelihood-based procedures 8. Adaptive inference.

534 citations


ReportDOI
TL;DR: In this article, a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps, is given and debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables.
Abstract: Many economic and causal parameters depend on nonparametric or high dimensional first steps. We give a general construction of locally robust/orthogonal moment functions for GMM, where moment conditions have zero derivative with respect to first steps. We show that orthogonal moment functions can be constructed by adding to identifying moments the nonparametric influence function for the effect of the first step on identifying moments. Orthogonal moments reduce model selection and regularization bias, as is very important in many applications, especially for machine learning first steps. We give debiased machine learning estimators of functionals of high dimensional conditional quantiles and of dynamic discrete choice parameters with high dimensional state variables. We show that adding to identifying moments the nonparametric influence function provides a general construction of orthogonal moments, including regularity conditions, and show that the nonparametric influence function is robust to additional unknown functions on which it depends. We give a general approach to estimating the unknown functions in the nonparametric influence function and use it to automatically debias estimators of functionals of high dimensional conditional location learners. We give a variety of new doubly robust moment equations and characterize double robustness. We give general and simple regularity conditions and apply these for asymptotic inference on functionals of high dimensional regression quantiles and dynamic discrete choice parameters with high dimensional state variables.

201 citations


Journal ArticleDOI
TL;DR: The basic concepts and practical use of nonparametric tests are discussed for the guide to the proper use.
Abstract: Conventional statistical tests are usually called parametric tests. Parametric tests are used more frequently than nonparametric tests in many medical articles, because most of the medical researchers are familiar with and the statistical software packages strongly support parametric tests. Parametric tests require important assumption; assumption of normality which means that distribution of sample means is normally distributed. However, parametric test can be misleading when this assumption is not satisfied. In this circumstance, nonparametric tests are the alternative methods available, because they do not required the normality assumption. Nonparametric tests are the statistical methods based on signs and ranks. In this article, we will discuss about the basic concepts and practical use of nonparametric tests for the guide to the proper use.

171 citations


Journal ArticleDOI
01 Jan 2016
TL;DR: Applications include testing independence by distance covariance, goodness‐of‐fit, nonparametric tests for equality of distributions and extension of analysis of variance, generalizations of clustering algorithms, change point analysis, feature selection, and more.
Abstract: Energy distance is a metric that measures the distance between the distributions of random vectors. Energy distance is zero if and only if the distributions are identical, thus it characterizes equality of distributions and provides a theoretical foundation for statistical inference and analysis. Energy statistics are functions of distances between observations in metric spaces. As a statistic, energy distance can be applied to measure the difference between a sample and a hypothesized distribution or the difference between two or more samples in arbitrary, not necessarily equal dimensions. The name energy is inspired by the close analogy with Newton's gravitational potential energy. Applications include testing independence by distance covariance, goodness-of-fit, nonparametric tests for equality of distributions and extension of analysis of variance, generalizations of clustering algorithms, change point analysis, feature selection, and more. WIREs Comput Stat 2016, 8:27-38. doi: 10.1002/wics.1375

149 citations


Journal ArticleDOI
TL;DR: It is shown that one can evade the curse of dimensionality by assuming a simplified vine copula model for the dependence between variables and a general nonparametric estimator is formulated and shown under high-level assumptions that the speed of convergence is independent of dimension.

137 citations


Journal ArticleDOI
TL;DR: In this paper, a statistical method for postprocessing ensembles based on quantile regression forests (QRF), a generalization of random forests for quantile regressions, is proposed.
Abstract: Ensembles used for probabilistic weather forecasting tend to be biased and underdispersive. This paper proposes a statistical method for postprocessing ensembles based on quantile regression forests (QRF), a generalization of random forests for quantile regression. This method does not fit a parametric probability density function (PDF) like in ensemble model output statistics (EMOS) but provides an estimation of desired quantiles. This is a nonparametric approach that eliminates any assumption on the variable subject to calibration. This method can estimate quantiles using not only members of the ensemble but any predictor available including statistics on other variables.The method is applied to the Meteo-France 35-member ensemble forecast (PEARP) for surface temperature and wind speed for available lead times from 3 up to 54 h and compared to EMOS. All postprocessed ensembles are much better calibrated than the PEARP raw ensemble and experiments on real data also show that QRF performs better t...

120 citations



Book
17 Sep 2016
TL;DR: This book introduces advanced undergraduate, graduate students and practitioners to statistical methods for ranking data and provides a novel and unifying approach for hypotheses testing.
Abstract: This book introduces advanced undergraduate, graduate students and practitioners to statistical methods for ranking data. An important aspect of nonparametric statistics is oriented towards the use of ranking data. Rank correlation is defined through the notion of distance functions and the notion of compatibility is introduced to deal with incomplete data. Ranking data are also modeled using a variety of modern tools such as CART, MCMC, EM algorithm and factor analysis. This book deals with statistical methods used for analyzing such data and provides a novel and unifying approach for hypotheses testing. The techniques described in the book are illustrated with examples and the statistical software is provided on the authors website.

108 citations


Journal ArticleDOI
TL;DR: In this paper, a nonparametric test of random utility models is proposed to test the null hypothesis that a sample of cross-sectional demand distributions was generated by a population of rational consumers.
Abstract: This paper develops and implements a nonparametric test of Random Utility Models. The motivating application is to test the null hypothesis that a sample of cross-sectional demand distributions was generated by a population of rational consumers. We test a necessary and sufficient condition for this that does not rely on any restriction on unobserved heterogeneity or the number of goods. We also propose and implement a control function approach to account for endogenous expenditure. An econometric result of independent interest is a test for linear inequality constraints when these are represented as the vertices of a polyhedron rather than its faces. An empirical application to the U.K. Household Expenditure Survey illustrates computational feasibility of the method in demand problems with 5 goods.

104 citations


Journal ArticleDOI
TL;DR: In this paper, an updated non-stationary bias-correction method for a monthly global climate model of temperature and precipitation was developed, which combines two widely used quantile mapping bias correction methods to eliminate potential illogical values of the variable.
Abstract: We developed an updated nonstationary bias-correction method for a monthly global climate model of temperature and precipitation. The proposed method combines two widely used quantile mapping bias-correction methods to eliminate potential illogical values of the variable. Instead of empirical parameter estimation in the more-common quantile mapping method, our study compared bias-correction performance when parametric or nonparametric procedures were used to estimate the probability distribution. The results showed our proposed bias-correction method to be very effective in reducing the model bias: it removed over 80% and 83% of model bias for surface air temperature and precipitation, respectively, during the validation period. Compared with a widely used method of bias correction (delta change), our proposed technique demonstrates improved correction of the distribution of variables. In addition, nonparametric estimation procedures further reduced the mean absolute errors in temperature and precipitation during the validation period by approximately 2% and 0.4%, respectively, compared with parametric procedures. The proposed method can remove over 40% and 60% of the uncertainty from model temperature and precipitation projections, respectively, at the global land scale.

Journal ArticleDOI
TL;DR: An accurate parametric model called modified hyperbolic tangent (MHTan) is proposed to characterize power curve of the wind turbine and the results demonstrate the efficiency of the proposed model compared to some other existing parametric and nonparametric models.

Posted Content
TL;DR: In this article, the authors develop a generalized empirical likelihood framework based on distributional uncertainty sets constructed from nonparametric $f$-divergence balls for Hadamard differentiable functionals, and in particular, stochastic optimization problems.
Abstract: We study statistical inference and distributionally robust solution methods for stochastic optimization problems, focusing on confidence intervals for optimal values and solutions that achieve exact coverage asymptotically. We develop a generalized empirical likelihood framework---based on distributional uncertainty sets constructed from nonparametric $f$-divergence balls---for Hadamard differentiable functionals, and in particular, stochastic optimization problems. As consequences of this theory, we provide a principled method for choosing the size of distributional uncertainty regions to provide one- and two-sided confidence intervals that achieve exact coverage. We also give an asymptotic expansion for our distributionally robust formulation, showing how robustification regularizes problems by their variance. Finally, we show that optimizers of the distributionally robust formulations we study enjoy (essentially) the same consistency properties as those in classical sample average approximations. Our general approach applies to quickly mixing stationary sequences, including geometrically ergodic Harris recurrent Markov chains.

Journal ArticleDOI
TL;DR: The possibilities and challenges of introducing shape constraints through this device are explored and illustrated through simulations and two real data examples.
Abstract: Gaussian processes are a popular tool for nonparametric function estimation because of their flexibility and the fact that much of the ensuing computation is parametric Gaussian computation. Often,...

Journal ArticleDOI
TL;DR: It is shown that a nonparametric f-divergence measure can be used to provide improved bounds on the minimum binary classification probability of error for the case when the training and test data are drawn from the same distribution.
Abstract: Information divergence functions play a critical role in statistics and information theory. In this paper we show that a nonparametric $f$ -divergence measure can be used to provide improved bounds on the minimum binary classification probability of error for the case when the training and test data are drawn from the same distribution and for the case where there exists some mismatch between training and test distributions. We confirm these theoretical results by designing feature selection algorithms using the criteria from these bounds and by evaluating the algorithms on a series of pathological speech classification tasks.

Book ChapterDOI
01 Jan 2016
TL;DR: The Mann-Whitney U-Test has many appropriate uses and it should be considered when using ranked data, data that deviate from acceptable distribution patterns, or for when there are noticeable differences in the number of subjects in the two comparative groups as mentioned in this paper.
Abstract: The Mann–Whitney U test is often viewed as the nonparametric equivalent of Student’s t-Test for Independent Samples, but this comparison may be somewhat too convenient. The two tests (the nonparametric Mann–Whitney U-Test and the parametric Student’s t-Test for Independent Samples) may have similar purposes in that they are both used to determine if there are statistically significant differences between two groups. However, the Mann–Whitney U-Test is used with nonparametric data (typically, ordinal data) whereas the Student’s t-Test for Independent Samples is used with data that meet the assumptions associated with parametric distributions (typically interval data that approximate an acceptable level of normal distribution). Even so, the Mann–Whitney U-Test has many appropriate uses and it should be considered when using ranked data, data that deviate from acceptable distribution patterns, or for when there are noticeable differences in the number of subjects in the two comparative groups.

Journal ArticleDOI
01 Feb 2016-PeerJ
TL;DR: In the two approaches, replacing the spurious singleton count by the estimated count can greatly remove the positive biases associated with diversity estimates due to spurious singletons and also make fair comparisons across microbial communities, as illustrated in the simulation results and in applying the method to analyze sequencing data from viral metagenomes.
Abstract: Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here a nonparametric estimator of the true singleton count to replace the spurious singleton count in all methods/approaches. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons, provided these three frequency counts are reliable. To quantify microbial alpha diversity for an individual community, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures' emphasis on rare or common species, include taxa richness (q = 0), Shannon diversity (q = 1, the exponential of Shannon entropy), and Simpson diversity (q = 2, the inverse of Simpson index). A diversity profile which depicts the Hill number as a function of order q conveys all information contained in a taxa abundance distribution. Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches (non-asymptotic and asymptotic) are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach refers to the comparison of estimated diversities of standardized samples with a common finite sample size or sample completeness. This approach aims to compare diversity estimates for equally-large or equally-complete samples; it is based on the seamless rarefaction and extrapolation sampling curves of Hill numbers, specifically for q = 0, 1 and 2. (2) An asymptotic approach refers to the comparison of the estimated asymptotic diversity profiles. That is, this approach compares the estimated profiles for complete samples or samples whose size tends to be sufficiently large. It is based on statistical estimation of the true Hill number of any order q ≥ 0. In the two approaches, replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons and also make fair comparisons across microbial communities, as illustrated in our simulation results and in applying our method to analyze sequencing data from viral metagenomes.

Journal ArticleDOI
TL;DR: The different methods of testing and the resulting p-values of such tests on datasets for four types of designs: between, within, mixed, and pretest-posttest designs are described.
Abstract: An increasing number of R packages include nonparametric tests for the interaction in two-way factorial designs. This paper briefly describes the different methods of testing and reports the resulting p-values of such tests on datasets for four types of designs: between, within, mixed, and pretest-posttest designs. Potential users are advised only to apply tests they are quite familiar with and not be guided by p-values for selecting packages and tests.

Journal ArticleDOI
TL;DR: In this article, a general introduction to Partial Least Squares SEM (PLS-SEM) is given and a step by step procedure along with R functions are presented to estimate the model.
Abstract: Structural equation modeling (SEM) has become widespread in educational and psychological research. Its flexibility in addressing complex theoretical models and the proper treatment of measurement error has made it the model of choice for many researchers in the social sciences. Nevertheless, the model imposes some daunting assumptions and restrictions (e.g. normality and relatively large sample sizes) that could discourage practitioners from applying the model. Partial least squares SEM (PLS-SEM) is a nonparametric technique which makes no distributional assumptions and can be estimated with small sample sizes. In this paper a general introduction to PLS-SEM is given and is compared with conventional SEM. Next, step by step procedures, along with R functions, are presented to estimate the model. A data set is analyzed and the outputs are interpreted.

Proceedings Article
01 Jan 2016
TL;DR: In this article, a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data, is developed.
Abstract: We develop a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data. In recent years, several algorithms have been proposed to learn from positive-unlabeled data; however, many of these contributions remain theoretical, performing poorly on real high-dimensional data that is typically contaminated with noise. We build on this previous work to develop two practical classification algorithms that explicitly model the noise in the positive labels and utilize univariate transforms built on discriminative classifiers. We prove that these univariate transforms preserve the class prior, enabling estimation in the univariate space and avoiding kernel density estimation for high-dimensional data. The theoretical development and parametric and nonparametric algorithms proposed here constitute an important step towards wide-spread use of robust classification algorithms for positive-unlabeled data.

Journal ArticleDOI
TL;DR: Quantitative analysis of zooarchaeological taxonomic abundances and skeletal part frequencies often relies on parametric techniques to test hypotheses, but archaeologists may choose to abandon statistical inference, but if so, they should temper how they use statistical tools.
Abstract: Quantitative analysis of zooarchaeological taxonomic abundances and skeletal part frequencies often relies on parametric techniques to test hypotheses. Data upon which such analyses are based are considered by some to be ‘ordinal scale at best’, meaning that non-parametric approaches may be better suited for addressing hypotheses. An important consideration is that archaeologists do not directly or randomly sample target populations of artefacts and faunal remains, which means that sampling error is not randomly generated. Thus, use of inferential statistics is potentially suspect. A solution to this problem is to rely on a weight of evidence research strategy and to limit analysis to descriptive statistics. Alternatively, if one chooses to use statistical inference, one should analyse effect size to determine practical significance of results and adopt conservative, robust inferential tests that require relatively few assumptions. Archaeologists may choose not to abandon statistical inference, but if so, they should temper how they use statistical tools. Copyright © 2014 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A new nonparametric methodology for monitoring location parameters when only a small reference dataset is available and the key idea is to construct a series of conditionally distribution-free test statistics in the sense that their distributions are free of the underlying distribution given the empirical distribution functions.
Abstract: Monitoring multivariate quality variables or data streams remains an important and challenging problem in statistical process control (SPC). Although the multivariate SPC has been extensively studied in the literature, designing distribution-free control schemes are still challenging and yet to be addressed well. This article develops a new nonparametric methodology for monitoring location parameters when only a small reference dataset is available. The key idea is to construct a series of conditionally distribution-free test statistics in the sense that their distributions are free of the underlying distribution given the empirical distribution functions. The conditional probability that the charting statistic exceeds the control limit at present given that there is no alarm before the current time point can be guaranteed to attain a specified false alarm rate. The success of the proposed method lies in the use of data-dependent control limits, which are determined based on the observations online rather...

Journal ArticleDOI
TL;DR: A new input variable selection method based on CMI that uses a nonparametric multivariate continuous probability estimator based on Edgeworth approximations (EA) is introduced and the superior performance of broCMI is demonstrated when compared to CMI‐based alternatives.
Abstract: The input variable selection problem has recently garnered much interest in the time series modeling community, especially within water resources applications, demonstrating that information theoretic (nonlinear)-based input variable selection algorithms such as partial mutual information (PMI) selection (PMIS) provide an improved representation of the modeled process when compared to linear alternatives such as partial correlation input selection (PCIS). PMIS is a popular algorithm for water resources modeling problems considering nonlinear input variable selection; however, this method requires the specification of two nonlinear regression models, each with parametric settings that greatly influence the selected input variables. Other attempts to develop input variable selection methods using conditional mutual information (CMI) (an analog to PMI) have been formulated under different parametric pretenses such as k nearest-neighbor (KNN) statistics or kernel density estimates (KDE). In this paper, we introduce a new input variable selection method based on CMI that uses a nonparametric multivariate continuous probability estimator based on Edgeworth approximations (EA). We improve the EA method by considering the uncertainty in the input variable selection procedure by introducing a bootstrap resampling procedure that uses rank statistics to order the selected input sets; we name our proposed method bootstrap rank-ordered CMI (broCMI). We demonstrate the superior performance of broCMI when compared to CMI-based alternatives (EA, KDE, and KNN), PMIS, and PCIS input variable selection algorithms on a set of seven synthetic test problems and a real-world urban water demand (UWD) forecasting experiment in Ottawa, Canada.

Journal ArticleDOI
TL;DR: This work surveys the methods of ISC group analysis that have been employed in the literature, and proposes less computationally intensive nonparametric methods that can be performed at the group level (for both one- and two-sample analyses), as compared to the popular method of circularly shifting the EPI time series at the individual level.

Journal ArticleDOI
TL;DR: This work develops a coherent methodology for the constructuion of bootstrap prediction intervals for time series that can be modeled as linear, nonlinear or nonparametric autoregressions, and presents detailed algorithms for these different models.

Proceedings Article
02 May 2016
TL;DR: In this paper, the authors show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric approximate posterior distribution, implicitly defined by the transformation of an initial distribution by a sequence of optimization steps.
Abstract: We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy over these distributions during optimization, we form a scalable, unbiased estimate of a variational lower bound on the log marginal likelihood. This bound can be used to optimize hyperparameters instead of cross-validation. This Bayesian interpretation of SGD suggests improved, overfitting-resistant optimization procedures, and gives a theoretical foundation for early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.

Journal ArticleDOI
TL;DR: In this article, a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$ is proposed for modal regression, and asymptotic error bounds for this method are derived.
Abstract: Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.

Journal ArticleDOI
TL;DR: In this article, a two-step estimation method of stochastic volatility models is proposed: in the first step, we nonparametrically estimate the (unobserved) instantaneous volatility process.
Abstract: A two-step estimation method of stochastic volatility models is proposed: In the first step, we nonparametrically estimate the (unobserved) instantaneous volatility process. In the second step, standard estimation methods for fully observed diffusion processes are employed, but with the filtered/estimated volatility process replacing the latent process. Our estimation strategy is applicable to both parametric and nonparametric stochastic volatility models, and can handle both jumps and market microstructure noise. The resulting estimators of the stochastic volatility model will carry additional biases and variances due to the first-step estimation, but under regularity conditions we show that these vanish asymptotically and our estimators inherit the asymptotic properties of the infeasible estimators based on observations of the volatility process. A simulation study examines the finite-sample properties of the proposed estimators.

Posted Content
TL;DR: This technical report revisits the analysis of family‐wise error rates in statistical parametric mapping—using random field theory—reported in Eklund and colleagues (Eklund et al. 2019) and unpack the implications of these results for parametric procedures.
Abstract: This technical report revisits the analysis of family-wise error rates in statistical parametric mapping - using random field theory - reported in (Eklund et al., 2015). Contrary to the understandable spin that these sorts of analyses attract, a review of their results suggests that they endorse the use of parametric assumptions - and random field theory - in the analysis of functional neuroimaging data. We briefly rehearse the advantages parametric analyses offer over nonparametric alternatives and then unpack the implications of (Eklund et al., 2015) for parametric procedures.

Journal ArticleDOI
TL;DR: Improvement and assessment of Marsan and Lengliné’s method in the following ways: the proposal of novel ways to incorporate a spatially inhomogeneous background rate; adding error bars to the histogram estimates which quantify the sampling variability in the estimation of the underlying seismic process.
Abstract: Space–time Hawkes point process models for the conditional rate of earthquake occurrences traditionally make many parametric assumptions about the form of the triggering function for the rate of aftershocks following an earthquake. As an alternative, Marsan and Lengline [Science 319 (2008) 1076–1079] developed a completely nonparametric method that provides an estimate of a homogeneous background rate for mainshocks, and a histogram estimate of the triggering function. At each step of the procedure the model estimates rely on computing the probability each earthquake is a mainshock or aftershock of a previous event. The focus of this paper is the improvement and assessment of Marsan and Lengline’s method in the following ways: (a) the proposal of novel ways to incorporate a spatially inhomogeneous background rate; (b) adding error bars to the histogram estimates which quantify the sampling variability in the estimation of the underlying seismic process. A simulation study is designed to evaluate and validate the ability of our methods to recover the triggering function and spatially varying background rate. An application to earthquake data from the Tohoku District in Japan is discussed at the end, and the results are compared to a well-established parametric model of seismicity for this region.