scispace - formally typeset
Search or ask a question

Showing papers on "Resampling published in 2017"


Journal ArticleDOI
TL;DR: It is shown that permutations of the raw observational (or ‘pre‐network’) data consistently account for underlying structure in the generated social network, and thus can reduce both type I and type II error rates.
Abstract: Null models are an important component of the social network analysis toolbox. However, their use in hypothesis testing is still not widespread. Furthermore, several different approaches for constructing null models exist, each with their relative strengths and weaknesses, and often testing different hypotheses.In this study, I highlight why null models are important for robust hypothesis testing in studies of animal social networks. Using simulated data containing a known observation bias, I test how different statistical tests and null models perform if such a bias was unknown.I show that permutations of the raw observational (or 'pre-network') data consistently account for underlying structure in the generated social network, and thus can reduce both type I and type II error rates. However, permutations of pre-network data remain relatively uncommon in animal social network analysis because they are challenging to implement for certain data types, particularly those from focal follows and GPS tracking.I explain simple routines that can easily be implemented across different types of data, and supply R code that applies each type of null model to the same simulated dataset. The R code can easily be modified to test hypotheses with empirical data. Widespread use of pre-network data permutation methods will benefit researchers by facilitating robust hypothesis testing.

312 citations


Journal ArticleDOI
TL;DR: The present study compared nonparametric bootstrap test with pooled resampling method corresponding to parametric, non Parametric, and permutation tests through extensive simulations under various conditions and using real data examples to overcome the problem related with small samples in hypothesis testing.
Abstract: Experimental studies in biomedical research frequently pose analytical problems related to small sample size. In such studies, there are conflicting findings regarding the choice of parametric and nonparametric analysis, especially with non-normal data. In such instances, some methodologists questioned the validity of parametric tests and suggested nonparametric tests. In contrast, other methodologists found nonparametric tests to be too conservative and less powerful and thus preferred using parametric tests. Some researchers have recommended using a bootstrap test; however, this method also has small sample size limitation. We used a pooled method in nonparametric bootstrap test that may overcome the problem related with small samples in hypothesis testing. The present study compared nonparametric bootstrap test with pooled resampling method corresponding to parametric, nonparametric, and permutation tests through extensive simulations under various conditions and using real data examples. The nonparametric pooled bootstrap t-test provided equal or greater power for comparing two means as compared with unpaired t-test, Welch t-test, Wilcoxon rank sum test, and permutation test while maintaining type I error probability for any conditions except for Cauchy and extreme variable lognormal distributions. In such cases, we suggest using an exact Wilcoxon rank sum test. Nonparametric bootstrap paired t-test also provided better performance than other alternatives. Nonparametric bootstrap test provided benefit over exact Kruskal-Wallis test. We suggest using nonparametric bootstrap test with pooled resampling method for comparing paired or unpaired means and for validating the one way analysis of variance test results for non-normal data in small sample size studies. Copyright © 2017 John Wiley & Sons, Ltd.

152 citations


Journal ArticleDOI
TL;DR: Five theoretical requirements which a generic ESS function should satisfy are listed, allowing us to classify different ESS measures, and several examples are provided involving, for instance, the geometric mean of the weights, the discrete entropy and the Gini coefficient.

120 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: Two methods to detect and localize image manipulations based on a combination of resampling features and deep learning are proposed, both effective in detecting and localizing digital image forgeries.
Abstract: Resampling is an important signature of manipulated images. In this paper, we propose two methods to detect and localize image manipulations based on a combination of resampling features and deep learning. In the first method, the Radon transform of resampling features are computed on overlapping image patches. Deep learning classifiers and a Gaussian conditional random field model are then used to create a heatmap. Tampered regions are located using a Random Walker segmentation method. In the second method, resampling features computed on overlapping image patches are passed through a Long short-term memory (LSTM) based network for classification and localization. We compare the performance of detection/localization of both these methods. Our experimental results show that both techniques are effective in detecting and localizing digital image forgeries.

107 citations


Journal ArticleDOI
TL;DR: In this article, independent components are estimated by combining a nonparametric probability integral transformation with a generalized non-parametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components.
Abstract: This article introduces a novel statistical framework for independent component analysis (ICA) of multivariate data. We propose methodology for estimating mutually independent components, and a versatile resampling-based procedure for inference, including misspecification testing. Independent components are estimated by combining a nonparametric probability integral transformation with a generalized nonparametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components. We prove the consistency of our estimator under minimal regularity conditions and detail conditions for consistency under model misspecification, all while placing assumptions on the observations directly, not on the latent components. U statistics of certain Euclidean distances between sample elements are combined to construct a test statistic for mutually independent components. The proposed measures and tests are based on both necessary and sufficient conditions for...

100 citations


Journal ArticleDOI
24 Jul 2017-PLOS ONE
TL;DR: This work proposes a bootstrap method based on probability integral transform (PIT-) residuals, which it calls the PIT-trap, which assumes data come from some marginal distribution F of known parametric form, and demonstrates via simulation to have improved properties as compared to competing resampling methods.
Abstract: Bootstrap methods are widely used in statistics, and bootstrapping of residuals can be especially useful in the regression context. However, difficulties are encountered extending residual resampling to regression settings where residuals are not identically distributed (thus not amenable to bootstrapping)—common examples including logistic or Poisson regression and generalizations to handle clustered or multivariate data, such as generalised estimating equations. We propose a bootstrap method based on probability integral transform (PIT-) residuals, which we call the PIT-trap, which assumes data come from some marginal distribution F of known parametric form. This method can be understood as a type of “model-free bootstrap”, adapted to the problem of discrete and highly multivariate data. PIT-residuals have the key property that they are (asymptotically) pivotal. The PIT-trap thus inherits the key property, not afforded by any other residual resampling approach, that the marginal distribution of data can be preserved under PIT-trapping. This in turn enables the derivation of some standard bootstrap properties, including second-order correctness of pivotal PIT-trap test statistics. In multivariate data, bootstrapping rows of PIT-residuals affords the property that it preserves correlation in data without the need for it to be modelled, a key point of difference as compared to a parametric bootstrap. The proposed method is illustrated on an example involving multivariate abundance data in ecology, and demonstrated via simulation to have improved properties as compared to competing resampling methods.

76 citations


Journal ArticleDOI
TL;DR: The automated SIR procedure was successfully applied over a large variety of cases, and its user-friendly implementation in the PsN program enables an efficient estimation of parameter uncertainty in NLMEM.
Abstract: Quantifying the uncertainty around endpoints used for decision-making in drug development is essential. In nonlinear mixed-effects models (NLMEM) analysis, this uncertainty is derived from the uncertainty around model parameters. Different methods to assess parameter uncertainty exist, but scrutiny towards their adequacy is low. In a previous publication, sampling importance resampling (SIR) was proposed as a fast and assumption-light method for the estimation of parameter uncertainty. A non-iterative implementation of SIR proved adequate for a set of simple NLMEM, but the choice of SIR settings remained an issue. This issue was alleviated in the present work through the development of an automated, iterative SIR procedure. The new procedure was tested on 25 real data examples covering a wide range of pharmacokinetic and pharmacodynamic NLMEM featuring continuous and categorical endpoints, with up to 39 estimated parameters and varying data richness. SIR led to appropriate results after 3 iterations on average. SIR was also compared with the covariance matrix, bootstrap and stochastic simulations and estimations (SSE). SIR was about 10 times faster than the bootstrap. SIR led to relative standard errors similar to the covariance matrix and SSE. SIR parameter 95% confidence intervals also displayed similar asymmetry to SSE. In conclusion, the automated SIR procedure was successfully applied over a large variety of cases, and its user-friendly implementation in the PsN program enables an efficient estimation of parameter uncertainty in NLMEM.

74 citations


Posted Content
TL;DR: In this paper, a combination of resampling features and deep learning is used to detect and localize image manipulations based on the Radon transform of features computed on overlapping image patches and a Long short-term memory (LSTM) based network for classification and localization.
Abstract: Resampling is an important signature of manipulated images. In this paper, we propose two methods to detect and localize image manipulations based on a combination of resampling features and deep learning. In the first method, the Radon transform of resampling features are computed on overlapping image patches. Deep learning classifiers and a Gaussian conditional random field model are then used to create a heatmap. Tampered regions are located using a Random Walker segmentation method. In the second method, resampling features computed on overlapping image patches are passed through a Long short-term memory (LSTM) based network for classification and localization. We compare the performance of detection/localization of both these methods. Our experimental results show that both techniques are effective in detecting and localizing digital image forgeries.

74 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed three new schemes that considerably improve the performance of the original PMC formulation by allowing for better exploration of the space of unknowns and by selecting more adequately the surviving samples.

68 citations


Journal ArticleDOI
TL;DR: A novel resampling technique focused on proper detection of minority examples in a two-class imbalanced data task is described and results indicate that the proposed algorithm usually outperforms the conventional oversampling approaches, especially when the detection of Minority examples is considered.
Abstract: Abstract Imbalanced data classification is one of the most widespread challenges in contemporary pattern recognition. Varying levels of imbalance may be observed in most real datasets, affecting the performance of classification algorithms. Particularly, high levels of imbalance make serious difficulties, often requiring the use of specially designed methods. In such cases the most important issue is often to properly detect minority examples, but at the same time the performance on the majority class cannot be neglected. In this paper we describe a novel resampling technique focused on proper detection of minority examples in a two-class imbalanced data task. The proposed method combines cleaning the decision border around minority objects with guided synthetic oversampling. Results of the conducted experimental study indicate that the proposed algorithm usually outperforms the conventional oversampling approaches, especially when the detection of minority examples is considered.

66 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that studentizing the sample correlation leads to a permutation test which is exact under independence and asymptotically controls the probability of Type 1 (or Type 3) errors.
Abstract: Given a sample from a bivariate distribution, consider the problem of testing independence. A permutation test based on the sample correlation is known to be an exact level α test. However, when used to test the null hypothesis that the samples are uncorrelated, the permutation test can have rejection probability that is far from the nominal level. Further, the permutation test can have a large Type 3 (directional) error rate, whereby there can be a large probability that the permutation test rejects because the sample correlation is a large positive value, when in fact the true correlation is negative. It will be shown that studentizing the sample correlation leads to a permutation test which is exact under independence and asymptotically controls the probability of Type 1 (or Type 3) errors. These conclusions are based on our results describing the almost sure limiting behavior of the randomization distribution. We will also present asymptotically robust randomization tests for regression coeffi...

Journal ArticleDOI
TL;DR: The results indicate that all implemented filters improve the estimation of water storage simulations of W3RA, and the best results are obtained using two versions of deterministic EnKF, i.e. the Square Root Analysis scheme and the Ensemble Square Root Filter.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: The results of these experiments show that the proposed constrained convolutional neural network can accurately detect resampling in re-compressed images in scenarios that previous approaches are unable to detect.
Abstract: Detecting image resampling in re-compressed images is a very challenging problem. Existing approaches to image resampling detection operate by building pre-selected model to locate periodicities in linear predictor residues. Additionally, if an image was JPEG compressed before resampling, existing techniques detect tampering using the artifacts left by the pre-compression. However, state-of-the-art approaches cannot detect resampling in re-compressed images initially compressed with high quality factor. In this paper, we propose a novel deep learning approach to adaptively learn resampling detection features directly from data. To accomplish this, we use our recently proposed constrained convolutional layer. Through a set of experiments we evaluate the effectiveness of our proposed constrained convolutional neural network (CNN) to detect resampling in re-compressed images. The results of these experiments show that our constrained CNN can accurately detect resampling in re-compressed images in scenarios that previous approaches are unable to detect.

Journal ArticleDOI
TL;DR: It is shown that central score stabilizes very quickly but that intersample variability shrinks after 10-15 experts while standard error of the scores continues to decrease as sample size increases, and that bootstrapping methods only reduce the estimated standard errors for small samples.

Proceedings ArticleDOI
TL;DR: Impact of resampling on classification accuracy is investigated, methods used to resample the dataset are compared, key points and difficulties are highlighted and the importance of accurate prediction of the minor class is highlighted.
Abstract: In many real-world binary classification tasks (e.g. detection of certain objects from images), an available dataset is imbalanced, i.e., it has much less representatives of a one class (a minor class), than of another. Generally, accurate prediction of the minor class is crucial but it's hard to achieve since there is not much information about the minor class. One approach to deal with this problem is to preliminarily resample the dataset, i.e., add new elements to the dataset or remove existing ones. Resampling can be done in various ways which raises the problem of choosing the most appropriate one. In this paper we experimentally investigate impact of resampling on classification accuracy, compare resampling methods and highlight key points and difficulties of resampling.

Journal ArticleDOI
TL;DR: In this article, a wild bootstrap procedure for cluster-robust inference in linear quantile regression models is proposed, which is easy to implement and performs well even when the number of clusters is much smaller than the sample size.
Abstract: In this article I develop a wild bootstrap procedure for cluster-robust inference in linear quantile regression models. I show that the bootstrap leads to asymptotically valid inference on the entire quantile regression process in a setting with a large number of small, heterogeneous clusters and provides consistent estimates of the asymptotic covariance function of that process. The proposed bootstrap procedure is easy to implement and performs well even when the number of clusters is much smaller than the sample size. An application to Project STAR data is provided. Supplementary materials for this article are available online.

Journal ArticleDOI
TL;DR: iMap4 is a freely available MATLAB open source toolbox for the statistical fixation mapping of eye movement data, with a user-friendly interface providing straightforward, easy-to-interpret statistical graphical outputs.
Abstract: A major challenge in modern eye movement research is to statistically map where observers are looking, by isolating the significant differences between groups and conditions. As compared to the signals from contemporary neuroscience measures, such as magneto/electroencephalography and functional magnetic resonance imaging, eye movement data are sparser, with much larger variations in space across trials and participants. As a result, the implementation of a conventional linear modeling approach on two-dimensional fixation distributions often returns unstable estimations and underpowered results, leaving this statistical problem unresolved (Liversedge, Gilchrist, & Everling, 2011). Here, we present a new version of the iMap toolbox (Caldara & Miellet, 2011) that tackles this issue by implementing a statistical framework comparable to those developed in state-of-the-art neuroimaging data-processing toolboxes. iMap4 uses univariate, pixel-wise linear mixed models on smoothed fixation data, with the flexibility of coding for multiple between- and within-subjects comparisons and performing all possible linear contrasts for the fixed effects (main effects, interactions, etc.). Importantly, we also introduced novel nonparametric tests based on resampling, to assess statistical significance. Finally, we validated this approach by using both experimental and Monte Carlo simulation data. iMap4 is a freely available MATLAB open source toolbox for the statistical fixation mapping of eye movement data, with a user-friendly interface providing straightforward, easy-to-interpret statistical graphical outputs. iMap4 matches the standards of robust statistical neuroimaging methods and represents an important step in the data-driven processing of eye movement fixation data, an important field of vision sciences.

Journal ArticleDOI
TL;DR: In this paper, a new methodology for sequential state and parameter estimation within the ensemble Kalman filter is proposed, which is fully Bayesian and propagates the joint posterior distribution of states and parameters over time.
Abstract: This paper proposes new methodology for sequential state and parameter estimation within the ensemble Kalman filter. The method is fully Bayesian and propagates the joint posterior distribution of states and parameters over time. To implement the method, the authors consider three representations of the marginal posterior distribution of the parameters: a grid-based approach, a Gaussian approximation, and a sequential importance sampling (SIR) approach with kernel resampling. In contrast to existing online parameter estimation algorithms, the new method explicitly accounts for parameter uncertainty and provides a formal way to combine information about the parameters from data at different time periods. The method is illustrated and compared to existing approaches using simulated and real data.

Posted Content
TL;DR: SpectResRes as discussed by the authors is a Python tool for resampling of spectral flux densities and their associated uncertainties onto different wavelength grids, and works with any grid of wavelength values, including nonuniform sampling, and preserves the integrated flux.
Abstract: I present a fast Python tool, SpectRes, for carrying out the resampling of spectral flux densities and their associated uncertainties onto different wavelength grids. The function works with any grid of wavelength values, including non-uniform sampling, and preserves the integrated flux. This may be of use for binning data to increase the signal to noise ratio, obtaining synthetic photometry, or resampling model spectra to match the sampling of observed data for spectral energy distribution fitting. The function can be downloaded from this https URL.

Journal ArticleDOI
12 Jan 2017-PeerJ
TL;DR: This work investigated the performance of regression-based methods using generalized linear models (GLM) as a promising alternative with statistical inference via site-based resampling along with approaches that mimicked the pmax test using GLM instead of fourth-corner.
Abstract: Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two p-values (the pmax test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the pmax test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an "omitted variable bias" problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based pmax test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.

Journal ArticleDOI
TL;DR: An improved confidence interval for the average annual percent change in trend analysis is considered, which is based on a weighted average of the regression slopes in the segmented line regression model with unknown change points.
Abstract: This paper considers an improved confidence interval for the average annual percent change in trend analysis, which is based on a weighted average of the regression slopes in the segmented line regression model with unknown change points. The performance of the improved confidence interval proposed by Muggeo is examined for various distribution settings, and two new methods are proposed for further improvement. The first method is practically equivalent to the one proposed by Muggeo, but its construction is simpler, and it is modified to use the t-distribution instead of the standard normal distribution. The second method is based on the empirical distribution of the residuals and the resampling using a uniform random sample, and its satisfactory performance is indicated by a simulation study. Copyright © 2017 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, a modification to the PERMANOVA test statistic, coupled with either permutation or bootstrap resampling methods, is proposed as a solution to the BFP for dissimilarity-based tests.
Abstract: Summary The essence of the generalised multivariate Behrens–Fisher problem (BFP) is how to test the null hypothesis of equality of mean vectors for two or more populations when their dispersion matrices differ. Solutions to the BFP usually assume variables are multivariate normal and do not handle high-dimensional data. In ecology, species' count data are often high-dimensional, non-normal and heterogeneous. Also, interest lies in analysing compositional dissimilarities among whole communities in non-Euclidean (semi-metric or non-metric) multivariate space. Hence, dissimilarity-based tests by permutation (e.g., PERMANOVA, ANOSIM) are used to detect differences among groups of multivariate samples. Such tests are not robust, however, to heterogeneity of dispersions in the space of the chosen dissimilarity measure, most conspicuously for unbalanced designs. Here, we propose a modification to the PERMANOVA test statistic, coupled with either permutation or bootstrap resampling methods, as a solution to the BFP for dissimilarity-based tests. Empirical simulations demonstrate that the type I error remains close to nominal significance levels under classical scenarios known to cause problems for the un-modified test. Furthermore, the permutation approach is found to be more powerful than the (more conservative) bootstrap for detecting changes in community structure for real ecological datasets. The utility of the approach is shown through analysis of 809 species of benthic soft-sediment invertebrates from 101 sites in five areas spanning 1960 km along the Norwegian continental shelf, based on the Jaccard dissimilarity measure.

Journal Article
TL;DR: An extensive empirical evaluation of resampling procedures for SVM hyperparameter selection concludes that a 2-fold procedure is appropriate to select the hyperparameters of an SVM for data sets for 1000 or more datapoints, while a 3-fold Procedure is appropriate for smaller data sets.
Abstract: Tuning the regularisation and kernel hyperparameters is a vital step in optimising the generalisation performance of kernel methods, such as the support vector machine (SVM). This is most often performed by minimising a resampling/cross-validation based model selection criterion, however there seems little practical guidance on the most suitable form of resampling. This paper presents the results of an extensive empirical evaluation of resampling procedures for SVM hyperparameter selection, designed to address this gap in the machine learning literature. We tested 15 different resampling procedures on 121 binary classification data sets in order to select the best SVM hyperparameters. We used three very different statistical procedures to analyse the results: the standard multi-classifier/multidata set procedure proposed by Demsar, the confidence intervals on the excess loss of each procedure in relation to 5-fold cross validation, and the Bayes factor analysis proposed by Barber. We conclude that a 2-fold procedure is appropriate to select the hyperparameters of an SVM for data sets for 1000 or more datapoints, while a 3-fold procedure is appropriate for smaller data sets.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed asymptotically valid inference methods for matching estimators based on the weighted bootstrap, where the key is to construct bootstrap counterparts by resampling based on certain linear forms of the estimators.
Abstract: It is known that the naive bootstrap is not asymptotically valid for a matching estimator of the average treatment effect with a fixed number of matches. In this article, we propose asymptotically valid inference methods for matching estimators based on the weighted bootstrap. The key is to construct bootstrap counterparts by resampling based on certain linear forms of the estimators. Our weighted bootstrap is applicable for the matching estimators of both the average treatment effect and its counterpart for the treated population. Also, by incorporating a bias correction method in Abadie and Imbens (2011), our method can be asymptotically valid even for matching based on a vector of covariates. A simulation study indicates that the weighted bootstrap method is favorably comparable with the asymptotic normal approximation. As an empirical illustration, we apply the proposed method to the National Supported Work data. Supplementary materials for this article are available online.

Journal ArticleDOI
28 Feb 2017-Glossa
TL;DR: The goals of the current study are to provide a fuller picture of the status of acceptability judgment data in syntax, and to provide detailed information that syntacticians can use to design and evaluate the sensitivity of acceptable judgment experiments in their own research.
Abstract: Previous investigations into the validity of acceptability judgment data have focused almost exclusively on type I errors (or false positives) because of the consequences of such errors for syntactic theories (Sprouse & Almeida 2012; Sprouse et al. 2013). The current study complements these previous studies by systematically investigating the type II error rate (false negatives), or equivalently, the statistical power, of a wide cross-section of possible acceptability judgment experiments. Though type II errors have historically been assumed to be less costly than type I errors, the dynamics of scientific publishing mean that high type II error rates (i.e., studies with low statistical power) can lead to increases in type I error rates in a given field of study. We present a set of experiments and resampling simulations to estimate statistical power for four tasks (forced-choice, Likert scale, magnitude estimation, and yes-no), 50 effect sizes instantiated by real phenomena, sample sizes from 5 to 100 participants, and two approaches to statistical analysis (null hypothesis and Bayesian). Our goals are twofold (i) to provide a fuller picture of the status of acceptability judgment data in syntax, and (ii) to provide detailed information that syntacticians can use to design and evaluate the sensitivity of acceptability judgment experiments in their own research.

Journal ArticleDOI
TL;DR: Since an interval of plausible resampling factors can be inferred from the position of the gap, it is empirically demonstrated that by using the resulting range as the search space of existing estimators, a better estimation accuracy can be attained with respect to the standalone versions of the latter.
Abstract: The forensic analysis of resampling traces in upscaled images is addressed via subspace decomposition and random matrix theory principles. In this context, we derive the asymptotic eigenvalue distribution of sample autocorrelation matrices corresponding to genuine and upscaled images. To achieve this, we model genuine images as an autoregressive random field and we characterize upscaled images as a noisy version of a lower dimensional signal. Following the intuition behind Marcenko-Pastur law, we show that for upscaled images, the gap between the eigenvalues corresponding to the low-dimensional signal and the ones from the background noise can be enhanced by extracting a small number of consecutive columns/rows from the matrix of observations. In addition, using bounds provided by the same law for the eigenvalues of the noise space, we propose a detector for exposing traces of resampling. Finally, since an interval of plausible resampling factors can be inferred from the position of the gap, we empirically demonstrate that by using the resulting range as the search space of existing estimators (based on different principles), a better estimation accuracy can be attained with respect to the standalone versions of the latter.

Journal ArticleDOI
TL;DR: A subsampling algorithm called nonsingular subsamplings is presented, which generates only nonsedular subsamples and is based on a modified LU decomposition algorithm that combines sample generation with solving the least squares problem.
Abstract: Simple random subsampling is an integral part of S estimation algorithms for linear regression. Subsamples are required to be nonsingular. Usually, discarding a singular subsample and drawing a new one leads to a sufficient number of nonsingular subsamples with a reasonable computational effort. However, this procedure can require so many subsamples that it becomes infeasible, especially if levels of categorical variables have low frequency. A subsampling algorithm called nonsingular subsampling is presented, which generates only nonsingular subsamples. When no singular subsamples occur, nonsingular subsampling is as fast as the simple algorithm, and if singular subsamples do occur, it maintains the same computational order. The algorithm works consistently, unless the full design matrix is singular. The method is based on a modified LU decomposition algorithm that combines sample generation with solving the least squares problem. The algorithm may also be useful for ordinary bootstrapping. Since the method allows for S estimation in designs with factors and interactions between factors and continuous regressors, we study properties of the resulting estimators, both in the sense of their dependence on the randomness of the sampling and of their statistical performance.

Journal ArticleDOI
TL;DR: Results show a significant increase in predictive accuracy on rare cases associated with using resampling strategies, and the use of biased strategies further increases accuracy over non-biased strategies.
Abstract: Time series forecasting is a challenging task, where the non-stationary characteristics of data portray a hard setting for predictive tasks. A common issue is the imbalanced distribution of the target variable, where some values are very important to the user but severely under-represented. Standard prediction tools focus on the average behaviour of the data. However, the objective is the opposite in many forecasting tasks involving time series: predicting rare values. A common solution to forecasting tasks with imbalanced data is the use of resampling strategies, which operate on the learning data by changing its distribution in favour of a given bias. The objective of this paper is to provide solutions capable of significantly improving the predictive accuracy on rare cases in forecasting tasks using imbalanced time series data. We extend the application of resampling strategies to the time series context and introduce the concept of temporal and relevance bias in the case selection process of such strategies, presenting new proposals. We evaluate the results of standard forecasting tools and the use of resampling strategies, with and without bias over 24 time series data sets from six different sources. Results show a significant increase in predictive accuracy on rare cases associated with using resampling strategies, and the use of biased strategies further increases accuracy over non-biased strategies.

Posted Content
TL;DR: In this article, the authors studied the convergence and convergence rates of resampling schemes and showed that the convergence rate of a particle algorithm based on a stochastic rounding technique converges regardless of the order of the input samples.
Abstract: We study convergence and convergence rates for resampling schemes. Our first main result is a general consistency theorem based on the notion of negative association, which is applied to establish the almost-sure weak convergence of measures output from Kitagawa's (1996) stratified resampling method. Carpenter et al's (1999) systematic resampling method is similar in structure but can fail to converge depending on the order of the input samples. We introduce a new resampling algorithm based on a stochastic rounding technique of Srinivasan (2001), which shares some attractive properties of systematic resampling, but which exhibits negative association and therefore converges irrespective of the order of the input samples. We confirm a conjecture made by Kitagawa (1996) that ordering input samples by their states in $\mathbb{R}$ yields a faster rate of convergence; we establish that when particles are ordered using the Hilbert curve in $\mathbb{R}^d$, the variance of the resampling error is ${\scriptscriptstyle\mathcal{O}}(N^{-(1+1/d)})$ under mild conditions, where $N$ is the number of particles. We use these results to establish asymptotic properties of particle algorithms based on resampling schemes that differ from multinomial resampling.

Journal ArticleDOI
TL;DR: In this article, a non-parametric method is applied to quantify residual uncertainty in hydrologic streamflow forecasting, which acts as a post-processor on deterministic model forecasts and generates a residual uncertainty distribution.
Abstract: . A non-parametric method is applied to quantify residual uncertainty in hydrologic streamflow forecasting. This method acts as a post-processor on deterministic model forecasts and generates a residual uncertainty distribution. Based on instance-based learning, it uses a k nearest-neighbour search for similar historical hydrometeorological conditions to determine uncertainty intervals from a set of historical errors, i.e. discrepancies between past forecast and observation. The performance of this method is assessed using test cases of hydrologic forecasting in two UK rivers: the Severn and Brue. Forecasts in retrospect were made and their uncertainties were estimated using kNN resampling and two alternative uncertainty estimators: quantile regression (QR) and uncertainty estimation based on local errors and clustering (UNEEC). Results show that kNN uncertainty estimation produces accurate and narrow uncertainty intervals with good probability coverage. Analysis also shows that the performance of this technique depends on the choice of search space. Nevertheless, the accuracy and reliability of uncertainty intervals generated using kNN resampling are at least comparable to those produced by QR and UNEEC. It is concluded that kNN uncertainty estimation is an interesting alternative to other post-processors, like QR and UNEEC, for estimating forecast uncertainty. Apart from its concept being simple and well understood, an advantage of this method is that it is relatively easy to implement.