scispace - formally typeset
Search or ask a question

Showing papers on "Resampling published in 2019"


Journal ArticleDOI
TL;DR: In this article, the effects of spatial autocorrelation on hyperparameter tuning and performance estimation by comparing several widely used machine-learning algorithms such as boosted regression trees (BRT), k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) with traditional parametric algorithms, such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM) in terms of predictive performance.

173 citations


Journal ArticleDOI
TL;DR: An inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties is presented, based on resampling, with multiple models being generated based on subsampling of the same training data.
Abstract: We present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties. The scheme is based on resampling, with multiple models being generated based on subsampling of the same training data. The accuracy of the uncertainty prediction can be benchmarked by maximum likelihood estimation, which can also be used to correct for correlations between resampled models and to improve the performance of the uncertainty estimation by a cross-validation procedure. In the case of sparse Gaussian Process Regression models, this resampled estimator can be evaluated at negligible cost. We demonstrate the reliability of these estimates for the prediction of molecular and materials energetics and for the estimation of nuclear chemical shieldings in molecular crystals. Extension to estimate the uncertainty in energy differences, forces, or other correlated predictions is straightforward. This method can be easily applied to other machine-learning schemes and will be beneficial to make data-driven predictions more reliable and to facilitate training-set optimization and active-learning strategies.

112 citations


Journal ArticleDOI
TL;DR: This work proposes a test of independence of two multivariate random vectors, given a sample from the underlying population, based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances.
Abstract: We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data.

83 citations


Journal ArticleDOI
04 Feb 2019
TL;DR: The adequacies of several proposed and two new statistical methods for recovering the causal structure of systems with feedback from synthetic BOLD time series are tested and Fast Adjacency Skewness (FASK) and Two-Step, both of which exploit non-Gaussian features of the BOLD signal are introduced.
Abstract: We test the adequacies of several proposed and two new statistical methods for recovering the causal structure of systems with feedback from synthetic BOLD time series. We compare an adaptation of the first correct method for recovering cyclic linear systems; Granger causal regression; a multivariate autoregressive model with a permutation test; the Group Iterative Multiple Model Estimation (GIMME) algorithm; the Ramsey et al. non-Gaussian methods; two non-Gaussian methods by Hyvarinen and Smith; a method due to Patel et al.; and the GlobalMIT algorithm. We introduce and also compare two new methods, Fast Adjacency Skewness (FASK) and Two-Step, both of which exploit non-Gaussian features of the BOLD signal. We give theoretical justifications for the latter two algorithms. Our test models include feedback structures with and without direct feedback (2-cycles), excitatory and inhibitory feedback, models using experimentally determined structural connectivities of macaques, and empirical human resting-state and task data. We find that averaged over all of our simulations, including those with 2-cycles, several of these methods have a better than 80% orientation precision (i.e., the probability of a directed edge is in the true structure given that a procedure estimates it to be so) and the two new methods also have better than 80% recall (probability of recovering an orientation in the true structure).

66 citations


Journal ArticleDOI
TL;DR: New developed statistical methods for the analysis of repeated measures designs and multivariate data that neither assume multivariate normality nor specific covariance matrices have been implemented in the freely available R-package MANOVA.RM.
Abstract: The numerical availability of statistical inference methods for a modern and robust analysis of longitudinal- and multivariate data in factorial experiments is an essential element in research and education. While existing approaches that rely on specific distributional assumptions of the data (multivariate normality and/or characteristic covariance matrices) are implemented in statistical software packages, there is a need for user-friendly software that can be used for the analysis of data that do not fulfill the aforementioned assumptions and provide accurate p-value and confidence interval estimates. Therefore, newly developed statistical methods for the analysis of repeated measures designs and multivariate data that neither assume multivariate normality nor specific covariance matrices have been implemented in the freely available R-package MANOVA.RM. The package is equipped with a graphical user interface for plausible applications in academia and other educational purpose. Several motivating examples illustrate the application of the methods.

65 citations


Journal ArticleDOI
TL;DR: The experimental results obtained indicate that there were statistical differences between the prediction results with and without resampling methods when evaluated with the geometric-mean, recall, probability of false alarms, and balance performance measures, implying that resamplings methods can help in defect classification but not defect prioritization.
Abstract: Software defect data sets are typically characterized by an unbalanced class distribution where the defective modules are fewer than the non-defective modules. Prediction performances of defect prediction models are detrimentally affected by the skewed distribution of the faulty minority modules in the data set since most algorithms assume both classes in the data set to be equally balanced. Resampling approaches address this concern by modifying the class distribution to balance the minority and majority class distribution. However, very little is known about the best distribution for attaining high performance especially in a more practical scenario. There are still inconclusive results pertaining to the suitable ratio of defect and clean instances (Pfp), the statistical and practical impacts of resampling approaches on prediction performance and the more stable resampling approach across several performance measures. To assess the impact of resampling approaches, we investigated the bias and effect of commonly used resampling approaches on prediction accuracy in software defect prediction. Analyzes of six resampling approaches on 40 releases of 20 open-source projects across five performance measures and five imbalance rates were performed. The experimental results obtained indicate that there were statistical differences between the prediction results with and without resampling methods when evaluated with the geometric-mean, recall(pd), probability of false alarms(pf ) and balance performance measures. However, resampling methods could not improve the AUC values across all prediction models implying that resampling methods can help in defect classification but not defect prioritization. A stable Pfp rate was dependent on the performance measure used. Lower Pfp rates are required for lower pf values while higher Pfp values are required for higher pd values. Random Under-Sampling and Borderline-SMOTE proved to be the more stable resampling method across several performance measures among the studied resampling methods. Performance of resampling methods are dependent on the imbalance ratio, evaluation measure and to some extent the prediction model. Newer oversampling methods should aim at generating relevant and informative data samples and not just increasing the minority samples.

64 citations


Journal ArticleDOI
TL;DR: In this article, the authors combined multiple imputation and bootstrap to obtain confidence intervals of the mean difference in outcome for two independent treatment groups in healthcare cost-effectiveness analysis.
Abstract: In healthcare cost‐effectiveness analysis, probability distributions are typically skewed and missing data are frequent Bootstrap and multiple imputation are well‐established resampling methods for handling skewed and missing data However, it is not clear how these techniques should be combined This paper addresses combining multiple imputation and bootstrap to obtain confidence intervals of the mean difference in outcome for two independent treatment groups We assessed statistical validity and efficiency of 10 candidate methods and applied these methods to a clinical data set Single imputation nested in the bootstrap percentile method (with added noise to reflect the uncertainty of the imputation) emerged as the method with the best statistical properties However, this method can require extensive computation times and the lack of standard software makes this method not accessible for a larger group of researchers Using a standard unpaired t‐test with standard multiple imputation without bootstrap appears to be a robust alternative with acceptable statistical performance for which standard multiple imputation software is available

55 citations


Journal ArticleDOI
22 Mar 2019
TL;DR: This paper discusses criterion-based, step-wise selection procedures and resampling methods for model selection, whereas cross-validation provides the most simple and generic means for computationally estimating all required entities.
Abstract: When performing a regression or classification analysis, one needs to specify a statistical model. This model should avoid the overfitting and underfitting of data, and achieve a low generalization error that characterizes its prediction performance. In order to identify such a model, one needs to decide which model to select from candidate model families based on performance evaluations. In this paper, we review the theoretical framework of model selection and model assessment, including error-complexity curves, the bias-variance tradeoff, and learning curves for evaluating statistical models. We discuss criterion-based, step-wise selection procedures and resampling methods for model selection, whereas cross-validation provides the most simple and generic means for computationally estimating all required entities. To make the theoretical concepts transparent, we present worked examples for linear regression models. However, our conceptual presentation is extensible to more general models, as well as classification problems.

54 citations


Journal ArticleDOI
TL;DR: A general consistency theorem based on the notion of negative association is applied to establish the almost-sure weak convergence of measures output from Kitagawa's (1996) stratified resampling method, and asymptotic properties of particle algorithms based on resampled schemes that differ from multinomial resamplings are established.
Abstract: We study convergence and convergence rates for resampling schemes. Our first main result is a general consistency theorem based on the notion of negative association, which is applied to establish the almost sure weak convergence of measures output from Kitagawa’s [J. Comput. Graph. Statist. 5 (1996) 1–25] stratified resampling method. Carpenter, Ckiffird and Fearnhead’s [IEE Proc. Radar Sonar Navig. 146 (1999) 2–7] systematic resampling method is similar in structure but can fail to converge depending on the order of the input samples. We introduce a new resampling algorithm based on a stochastic rounding technique of [In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001) (2001) 588–597 IEEE Computer Soc.], which shares some attractive properties of systematic resampling, but which exhibits negative association and, therefore, converges irrespective of the order of the input samples. We confirm a conjecture made by [J. Comput. Graph. Statist. 5 (1996) 1–25] that ordering input samples by their states in $\mathbb{R}$ yields a faster rate of convergence; we establish that when particles are ordered using the Hilbert curve in $\mathbb{R}^{d}$, the variance of the resampling error is ${\scriptstyle\mathcal{O}}(N^{-(1+1/d)})$ under mild conditions, where $N$ is the number of particles. We use these results to establish asymptotic properties of particle algorithms based on resampling schemes that differ from multinomial resampling.

48 citations


Journal ArticleDOI
TL;DR: An optimized particle filter using the maximum variance weight segmentation resampling algorithm is proposed in this paper, which improved the performance of particle filter and increased the accuracy and stability in motion trajectory tracking tasks.
Abstract: At present, urban computing and intelligence has become an important topic in the research field of artificial intelligence. On the other hand, computer vision as a crucial bridge between urban world and artificial intelligence is playing a key role in urban computing and intelligence. Conventional particle filter is derived from Karman filter, which theoretically based on Monte Carlo method. Sequential importance resampling (SIR) is implemented in conventional particle filter to avoid the degeneracy problem. In order to overcome the shortcomings of the resampling algorithm in the traditional particle filter, we proposed an optimized particle filter using the maximum variance weight segmentation resampling algorithm in this paper, which improved the performance of particle filter. Compared with the traditional particle filter algorithm, the experimental results show that the proposed scheme outperforms in terms of computational consumption and the accuracy of particle tracking. The final experimental results proved that the quality of the maximum variance weight segmentation method increased the accuracy and stability in motion trajectory tracking tasks.

48 citations


Journal ArticleDOI
TL;DR: New MGA tests are presented and validated, which are applicable in the context of PLS-PM and to compare their efficacy to existing approaches, and for the first time allows researchers to statistically compare a whole model across groups by applying a single statistical test.
Abstract: Purpose People seem to function according to different models, which implies that in business and social sciences, heterogeneity is a rule rather than an exception. Researchers can investigate such heterogeneity through multigroup analysis (MGA). In the context of partial least squares path modeling (PLS-PM), MGA is currently applied to perform multiple comparisons of parameters across groups. However, this approach has significant drawbacks: first, the whole model is not considered when comparing groups, and second, the family-wise error rate is higher than the predefined significance level when the groups are indeed homogenous, leading to incorrect conclusions. Against this background, the purpose of this paper is to present and validate new MGA tests, which are applicable in the context of PLS-PM, and to compare their efficacy to existing approaches. Design/methodology/approach The authors propose two tests that adopt the squared Euclidean distance and the geodesic distance to compare the model-implied indicator correlation matrix across groups. The authors employ permutation to obtain the corresponding reference distribution to draw statistical inference about group differences. A Monte Carlo simulation provides insights into the sensitivity and specificity of both permutation tests and their performance, in comparison to existing approaches. Findings Both proposed tests provide a considerable degree of statistical power. However, the test based on the geodesic distance outperforms the test based on the squared Euclidean distance in this regard. Moreover, both proposed tests lead to rejection rates close to the predefined significance level in the case of no group differences. Hence, our proposed tests are more reliable than an uncontrolled repeated comparison approach. Research limitations/implications Current guidelines on MGA in the context of PLS-PM should be extended by applying the proposed tests in an early phase of the analysis. Beyond our initial insights, more research is required to assess the performance of the proposed tests in different situations. Originality/value This paper contributes to the existing PLS-PM literature by proposing two new tests to assess multigroup differences. For the first time, this allows researchers to statistically compare a whole model across groups by applying a single statistical test.

Journal ArticleDOI
TL;DR: SRE combines the operators of resampling and periodical update to handle the joint issue of concept drift and class imbalance, and empirical studies demonstrate the effectiveness of SRE in learning nonstationary imbalanced data streams.
Abstract: Although the issues of concept drift and class imbalance have been studied separately, the joint problem is underexplored even though it has received increasing attention. Concept drift is further complicated when the dataset is class imbalanced. Meanwhile, most of the existing techniques have ignored the influence of complex data distribution on learning imbalanced data streams. To overcome these issues, we propose an ensemble-based model for learning concept drift from imbalanced data streams with complex data distribution, called selection-based resampling ensemble (SRE). SRE combines the operators of resampling and periodical update to handle the joint issue. In the chunk-based framework, a selection-based resampling mechanism, which focuses on drifting and unsafe examples, is first employed to re-balance the class distribution of the latest block. Then, previous ensemble members are periodically updated using the latest examples, where update weights are determined to emphasize costly misclassification examples and minority examples. Meanwhile, SRE can quickly react to new conditions. Empirical studies demonstrate the effectiveness of SRE in learning nonstationary imbalanced data streams.

Journal ArticleDOI
TL;DR: This paper analyzes the impact of imbalanced data distribution and positive and negative sample overlap on the machine learning classification model and presents a personality prediction method based on particle swarm optimization (PSO) and synthetic minority oversampling technique+Tomek Link (SMOTETomek)resampling ( PSO-SMOTetomek), which significantly outperforms existing state-of-the-art models.
Abstract: The main challenge of user personality recognition is low accuracy resulting from small sample size and severe sample distribution imbalance. This paper analyzes the impact of imbalanced data distribution and positive and negative sample overlap on the machine learning classification model. The classification model is based on the data resampling technique, which can improve the classification accuracy. These problems can be solved once the data are effectively resampled. We present a personality prediction method based on particle swarm optimization (PSO) and synthetic minority oversampling technique+Tomek Link (SMOTETomek)resampling (PSO-SMOTETomek), which, apart from effective SMOTETomek resampling of data samples, is able to execute PSO feature optimization for each set of feature combinations. Validated by simulation, our analysis reveals that the PSO-SMOTETomek method is efficient under a small dataset, and the accuracy of personality recognition is improved by up to around 10%. The results are better than those of previous similar studies. The average accuracies of the plain text dataset and the non-plain text dataset are 75.34% and 78.78%, respectively. The average accuracies of the short text dataset and the long text dataset are 75.34% and 64.25%, respectively. From the experimental results, we found that short text has a better classification effect than long text. Plain text data can still have high personality discrimination accuracy, but there is no relevant external information. The proposed model is able to facilitate the design and implementation of a personality recognition system, and the model significantly outperforms existing state-of-the-art models.

Journal ArticleDOI
TL;DR: It is demonstrated that application of site-wise bootstrapping generally resulted in gene-trees with substantial additional conflicts relative to the original data and this approach therefore cannot be relied upon to provide conservative support and is suggested that gene-wise resampling support should be favored over gene + site or site- wise resamplings when numerous genes are sampled.

Journal ArticleDOI
TL;DR: A parametric model is utilized to expose the traces of resampling forgery, which is described with the distribution of residual noise, and a statistical model describing the residual noise from a resampled image is proposed.
Abstract: The problem of authenticating a re-sampled image has been investigated over many years. Currently, however, little research proposes a statistical model-based test, resulting in that statistical performance of the resampling detector could not be completely analyzed. To fill the gap, we utilize a parametric model to expose the traces of resampling forgery, which is described with the distribution of residual noise. Afterward, we propose a statistical model describing the residual noise from a resampled image. Then, the detection problem is cast into the framework of hypothesis testing theory. By considering the image content with designing a texture weight map, two types of statistical detectors are established. In an ideal context in which all distribution parameters are perfectly known, the likelihood ratio test (LRT) is presented and its performance is theoretically established. An upper bound of the detection power can be successfully obtained from the statistical performance of an LRT. For practical use, when the distribution parameters are not known, a generalized LRT with three different maps based on estimation of parameters is established. Numerical results on simulated data and real natural images highlight the relevance of our proposed approach.

Journal ArticleDOI
TL;DR: This work proposes a novel, regionless, enhanced sampling method that is based on the weighted ensemble framework, and expects that resampling of ensembles by variation optimization will be a useful general tool to broadly explore free energy landscapes.
Abstract: Conventional molecular dynamics simulations are incapable of sampling many important interactions in biomolecular systems due to their high dimensionality and rough energy landscapes. To observe rare events and calculate transition rates in these systems, enhanced sampling is a necessity. In particular, the study of ligand-protein interactions necessitates a diverse ensemble of protein conformations and transition states, and for many systems, this occurs on prohibitively long time scales. Previous strategies such as WExplore that can be used to determine these types of ensembles are hindered by problems related to the regioning of conformational space. Here, we propose a novel, regionless, enhanced sampling method that is based on the weighted ensemble framework. In this method, a value referred to as “trajectory variation” is optimized after each cycle through cloning and merging operations. This method allows for a more consistent measurement of observables and broader sampling resulting in the efficient exploration of previously unexplored conformations. We demonstrate the performance of this algorithm with the N-dimensional random walk and the unbinding of the trypsin-benzamidine system. The system is analyzed using conformation space networks, the residence time of benzamidine is confirmed, and a new unbinding pathway for the trypsin-benzamidine system is found. We expect that resampling of ensembles by variation optimization will be a useful general tool to broadly explore free energy landscapes.

Journal ArticleDOI
TL;DR: In this paper, the authors provide an equivalent formulation of bootstrap consistency in the space of bounded functions, which is more intuitive and easy to work with than the weak convergence of conditional laws in the Hoffmann-Jorgensen sense.
Abstract: The consistency of a bootstrap or resampling scheme is classically validated by weak convergence of conditional laws. However, when working with stochastic processes in the space of bounded functions and their weak convergence in the Hoffmann–Jorgensen sense, an obstacle occurs: due to possible non-measurability, neither laws nor conditional laws are well defined. Starting from an equivalent formulation of weak convergence based on the bounded Lipschitz metric, a classical circumvention is to formulate bootstrap consistency in terms of the latter distance between what might be called a conditional law of the (non-measurable) bootstrap process and the law of the limiting process. The main contribution of this note is to provide an equivalent formulation of bootstrap consistency in the space of bounded functions which is more intuitive and easy to work with. Essentially, the equivalent formulation consists of (unconditional) weak convergence of the original process jointly with two bootstrap replicates. As a by-product, we provide two equivalent formulations of bootstrap consistency for statistics taking values in separable metric spaces: the first in terms of (unconditional) weak convergence of the statistic jointly with its bootstrap replicates, the second in terms of convergence in probability of the empirical distribution function of the bootstrap replicates. Finally, the asymptotic validity of bootstrap-based confidence intervals and tests is briefly revisited, with particular emphasis on the (in practice, unavoidable) Monte Carlo approximation of conditional quantiles.

Proceedings ArticleDOI
18 Jul 2019
TL;DR: Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, this study is finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
Abstract: Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.

Journal ArticleDOI
TL;DR: In this paper, a new estimator of the restricted mean survival time in randomized trials where there is right censoring that may depend on treatment and baseline variables is presented, which leverages prognostic baseline variables to obtain equal or better asymptotic precision compared to traditional estimators.
Abstract: We present a new estimator of the restricted mean survival time in randomized trials where there is right censoring that may depend on treatment and baseline variables. The proposed estimator leverages prognostic baseline variables to obtain equal or better asymptotic precision compared to traditional estimators. Under regularity conditions and random censoring within strata of treatment and baseline variables, the proposed estimator has the following features: (i) it is interpretable under violations of the proportional hazards assumption; (ii) it is consistent and at least as precise as the Kaplan-Meier and inverse probability weighted estimators, under identifiability conditions; (iii) it remains consistent under violations of independent censoring (unlike the Kaplan-Meier estimator) when either the censoring or survival distributions, conditional on covariates, are estimated consistently; and (iv) it achieves the nonparametric efficiency bound when both of these distributions are consistently estimated. We illustrate the performance of our method using simulations based on resampling data from a completed, phase 3 randomized clinical trial of a new surgical treatment for stroke; the proposed estimator achieves a 12% gain in relative efficiency compared to the Kaplan-Meier estimator. The proposed estimator has potential advantages over existing approaches for randomized trials with time-to-event outcomes, since existing methods either rely on model assumptions that are untenable in many applications, or lack some of the efficiency and consistency properties (i)-(iv). We focus on estimation of the restricted mean survival time, but our methods may be adapted to estimate any treatment effect measure defined as a smooth contrast between the survival curves for each study arm. We provide R code to implement the estimator.

Journal ArticleDOI
01 May 2019-Geoderma
TL;DR: In this article, the DSMART algorithm was used to disaggregate conventional soil maps and to produce high-quality soil maps when point observations are not available, and the results demonstrated that a suitable approach can provide reliable soil maps at a national extent.

Proceedings ArticleDOI
13 Jul 2019
TL;DR: This work shows that when the noisy fitness is ϵ-concentrated, then a logarithmic number of samples suffice to discover the undisturbed fitness (via the median of the samples) with high probability and gives a simple metaheuristic approach to transform a randomized optimization heuristics into one that is robust to this type of noise.
Abstract: Due to their randomized nature, many nature-inspired heuristics are robust to some level of noise in the fitness evaluations. A common strategy to increase the tolerance to noise is to re-evaluate the fitness of a solution candidate several times and to then work with the average of the sampled fitness values. In this work, we propose to use the median instead of the mean. Besides being invariant to rescalings of the fitness, the median in many situations turns out to be much more robust than the mean. We show that when the noisy fitness is ϵ-concentrated, then a logarithmic number of samples suffice to discover the undisturbed fitness (via the median of the samples) with high probability. This gives a simple metaheuristic approach to transform a randomized optimization heuristics into one that is robust to this type of noise and that has a runtime higher than the original one only by a logarithmic factor. We show further that ϵ-concentrated noise occurs frequently in standard situations. We also provide lower bounds showing that in two such situations, even with larger numbers of samples, the average-resample strategy cannot efficiently optimize the problem in polynomial time.

Journal ArticleDOI
TL;DR: In this article, the authors study the problem of selecting the optimal sample size and the number of folds in a k-fold cross-validation model for logistic regression and support vector machines.

Journal ArticleDOI
TL;DR: A new resampling algorithm, called Similarity Oversampling and Undersampling Preprocessing (SOUP), which resamples examples according to their difficulty and is competitive with the most popular decomposition ensembles and better than specialized preprocessing techniques for multi-imbalanced problems.
Abstract: Abstract The relations between multiple imbalanced classes can be handled with a specialized approach which evaluates types of examples’ difficulty based on an analysis of the class distribution in the examples’ neighborhood, additionally exploiting information about the similarity of neighboring classes. In this paper, we demonstrate that such an approach can be implemented as a data preprocessing technique and that it can improve the performance of various classifiers on multiclass imbalanced datasets. It has led us to the introduction of a new resampling algorithm, called Similarity Oversampling and Undersampling Preprocessing (SOUP), which resamples examples according to their difficulty. Its experimental evaluation on real and artificial datasets has shown that it is competitive with the most popular decomposition ensembles and better than specialized preprocessing techniques for multi-imbalanced problems.

Journal ArticleDOI
TL;DR: In this paper, a local two-sample testing framework is proposed to identify local differences between multivariate distributions with statistical confidence, which can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios.
Abstract: Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.


Journal ArticleDOI
TL;DR: Whether this approach to imputation accuracy under the Missing (Completely) at Random scheme can be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures is studied.
Abstract: Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. An empirical analysis focusing on credit information and Facebook data complements our investigations.

Journal ArticleDOI
TL;DR: A Gaussian mixture model based combined resampling algorithm is proposed that consistently improves classification performances such as F-measure, AUC, G-mean, and so on and has strong robustness for credit data sets.
Abstract: Credit scoring represents a two-classification problem. Moreover, the data imbalance of the credit data sets, where one class contains a small number of data samples and the other contains a large number of data samples, is an often problem. Therefore, if only a traditional classifier is used to classify the data, the final classification effect will be affected. To improve the classification of the credit data sets, a Gaussian mixture model based combined resampling algorithm is proposed. This resampling approach first determines the number of samples of the majority class and the minority class using a sampling factor. Then, the Gaussian mixture clustering is used for undersampling of the majority of samples, and the synthetic minority oversampling technique is used for the rest of the samples, so an eventual imbalance problem is eliminated. Here we compare several resampling methods commonly used in the analysis of imbalanced credit data sets. The obtained experimental results demonstrate that the proposed method consistently improves classification performances such as F-measure, AUC, G-mean, and so on. In addition, the method has strong robustness for credit data sets.

Journal ArticleDOI
TL;DR: This work explores four strategies for updating Monte Carlo simulations for a change in distribution and shows that, when the change in measure is small, importance sampling reweighting can be very effective and a proposed mixed augmenting-filtering algorithm can robustly and efficiently accommodate a measure change in Monte Carlo simulation.

Journal ArticleDOI
TL;DR: An exact, unconditional, non-randomized procedure for producing confidence intervals for the grand mean in a normal-normal random effects meta-analysis of meta-analyses investigating the effect of calcium intake on bone mineral density is described.
Abstract: We describe an exact, unconditional, non-randomized procedure for producing confidence intervals for the grand mean in a normal-normal random effects meta-analysis. The procedure targets meta-analyses based on too few primary studies, ≤ 7 , say, to allow for the conventional asymptotic estimators, e.g., DerSimonian and Laird (1986), or non-parametric resampling-based procedures, e.g., Liu et al. (2017). Meta-analyses with such few studies are common, with one recent sample of 22,453 heath-related meta-analyses finding a median of 3 primary studies per meta-analysis (Davey et al., 2011). Reliable and efficient inference procedures are therefore needed to address this setting. The coverage level of the resulting CI is guaranteed to be above the nominal level, up to Monte Carlo error, provided the meta-analysis contains more than 1 study and the model assumptions are met. After employing several techniques to accelerate computation, the new CI can be easily constructed on a personal computer. Simulations suggest that the proposed CI typically is not overly conservative. We illustrate the approach on several contrasting examples of meta-analyses investigating the effect of calcium intake on bone mineral density.

Book ChapterDOI
13 Oct 2019
TL;DR: The novel transposition test is proposed that exploits the underlying algebraic structure of the permutation group and is applied to a large number of diffusion tensor images in localizing the regions of the brain network differences.
Abstract: The permutation test is an often used test procedure for determining statistical significance in brain network studies. Unfortunately, generating every possible permutation for large-scale brain imaging datasets such as HCP and ADNI with hundreds of subjects is not practical. Many previous attempts at speeding up the permutation test rely on various approximation strategies such as estimating the tail distribution with known parametric distributions. In this study, we propose the novel transposition test that exploits the underlying algebraic structure of the permutation group. The method is applied to a large number of diffusion tensor images in localizing the regions of the brain network differences.