scispace - formally typeset
Search or ask a question

Showing papers on "Resampling published in 2021"


Journal ArticleDOI
TL;DR: In this article, the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) was used to identify the best imbalance techniques suitable for medical datasets.
Abstract: Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/

62 citations


Journal ArticleDOI
TL;DR: In this paper, resampling is used to adjust the ratio between the different classes, making the data more balanced and improving the performance of artificial neural network multi-class classifiers.
Abstract: Machine learning plays an increasingly significant role in the building of Network Intrusion Detection Systems However, machine learning models trained with imbalanced cybersecurity data cannot recognize minority data, hence attacks, effectively One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced This research looks at resampling’s influence on the performance of Artificial Neural Network multi-class classifiers The resampling methods, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with Synthetic Minority Oversampling Technique, and random undersampling with Adaptive Synthetic Sampling Method were used on benchmark Cybersecurity datasets, KDD99, UNSW-NB15, UNSW-NB17 and UNSW-NB18 Macro precision, macro recall, macro F1-score were used to evaluate the results The patterns found were: First, oversampling increases the training time and undersampling decreases the training time; second, if the data is extremely imbalanced, both oversampling and undersampling increase recall significantly; third, if the data is not extremely imbalanced, resampling will not have much of an impact; fourth, with resampling, mostly oversampling, more of the minority data (attacks) were detected

62 citations


Journal ArticleDOI
TL;DR: It is shown that datastream permutations typically do not represent the null hypothesis of interest to researchers interfacing animal social network analysis with regression modelling, and simulations are used to demonstrate the potential pitfalls of using this methodology.
Abstract: O_LISocial network methods have become a key tool for describing, modelling, and testing hypotheses about the social structures of animals. However, due to the non-independence of network data and the presence of confounds, specialized statistical techniques are often needed to test hypotheses in these networks. Datastream permutations, originally developed to test the null hypothesis of random social structure, have become a popular tool for testing a wide array of null hypotheses. In particular, they have been used to test whether exogenous factors are related to network structure by interfacing these permutations with regression models. C_LIO_LIHere, we show that these datastream permutations typically do not represent the null hypothesis of interest to researchers interfacing animal social network analysis with regression modelling, and use simulations to demonstrate the potential pitfalls of using this methodology. C_LIO_LIOur simulations show that utilizing common datastream permutations to test the coefficients of regression models can lead to extremely high type I (false-positive) error rates (> 30%) in the presence of non-random social structure. The magnitude of this problem is primarily dependent on the degree of non-randomness within the social structure and the intensity of sampling C_LIO_LIWe strongly recommend against utilizing datastream permutations to test regression models in animal social networks. We suggest that a potential solution may be found in regarding the problems of non-independence of network data and unreliability of observations as separate problems with distinct solutions. C_LI

45 citations


Journal ArticleDOI
TL;DR: The results show that for the nowcast problem, resampling and data augmentation can effectively enhance the model performance, reducing overall root mean squared error (RMSE) by an average of 4%, or a 15 std.

38 citations


Journal ArticleDOI
TL;DR: In this article, a CT texture-based model was developed to predict epidermal growth factor receptor (EGFR)-mutated, anaplastic lymphoma kinase (ALK)-rearranged lung adenocarcinomas and distinguish them from wild-type tumors on pre-treatment CT scans.
Abstract: To develop a CT texture-based model able to predict epidermal growth factor receptor (EGFR)-mutated, anaplastic lymphoma kinase (ALK)-rearranged lung adenocarcinomas and distinguish them from wild-type tumors on pre-treatment CT scans. Texture analysis was performed using proprietary software TexRAD (TexRAD Ltd, Cambridge, UK) on pre-treatment contrast-enhanced CT scans of 84 patients with metastatic primary lung adenocarcinoma. Textural features were quantified using the filtration-histogram approach with different spatial scale filters on a single 5-mm-thick central slice considered representative of the whole tumor. In order to deal with class imbalance regarding mutational status percentages in our population, the dataset was optimized using the synthetic minority over-sampling technique (SMOTE) and correlations with textural features were investigated using a generalized boosted regression model (GBM) with a nested cross-validation approach (performance averaged over 1000 resampling episodes). ALK rearrangements, EGFR mutations and wild-type tumors were observed in 19, 28 and 37 patients, respectively, in the original dataset. The balanced dataset was composed of 171 observations. Among the 29 original texture variables, 17 were employed for model building. Skewness on unfiltered images and on fine texture was the most important features. EGFR-mutated tumors showed the highest skewness while ALK-rearranged tumors had the lowest values with wild-type tumors showing intermediate values. The average accuracy of the model calculated on the independent nested validation set was 81.76% (95% CI 81.45–82.06). Texture analysis, in particular skewness values, could be promising for noninvasive characterization of lung adenocarcinoma with respect to EGFR and ALK mutations.

37 citations


Journal Article
TL;DR: This work proposes a simple resampling scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost and theoretically and experimentally validate the approach, showing that combining resamplings with existing robust algorithm is effective against challenging attacks.
Abstract: In Byzantine-robust distributed optimization, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages to the server. While this problem has received significant attention recently, most current defenses assume that the workers have identical data distribution. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks that circumvent these defenses leading to significant loss of performance. We then propose a universal resampling scheme that addresses data heterogeneity at a negligible computational cost. We theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks.

28 citations


Journal ArticleDOI
TL;DR: The optimized Random Forest allows the samples with a higher probability of being rejected to be selected, thus improving the effectiveness of the quality control and has enabled to identify the most important features, which have been satisfactorily interpreted on a metallurgical basis.
Abstract: Non-metallic inclusions are unavoidably produced during steel casting resulting in lower mechanical strength and other detrimental effects. This study was aimed at developing a machine learning algorithm to classify castings of steel for tire reinforcement depending on the number and properties of inclusions, experimentally determined. 855 observations were available for training, validation and testing the algorithms, obtained from the quality control of the steel. 140 parameters are monitored during fabrication, which are the features of the analysis; the output is 1 or 0 depending on whether the casting is rejected or not. The following algorithms have been employed: Logistic Regression, K-Nearest Neighbors, Support Vector Classifier (linear and RBF kernels), Random Forests, AdaBoost, Gradient Boosting and Artificial Neural Networks. The reduced value of the rejection rate implies that classification must be carried out on an imbalanced dataset. Resampling methods and specific scores for imbalanced datasets (recall, precision and AUC rather than accuracy) were used. Random Forest was the most successful algorithm providing an AUC in the test set of 0.85. No significant improvements were detected after resampling. The optimized Random Forest allows the samples with a higher probability of being rejected to be selected, thus improving the effectiveness of the quality control. In addition, the optimized Random Forest has enabled to identify the most important features, which have been satisfactorily interpreted on a metallurgical basis.

20 citations


Journal ArticleDOI
01 Mar 2021
TL;DR: This tutorial explains the application of resampling techniques (including bootstrap) to the evaluation of neural networks’ performance from both a theoretical and practical point of view and presents a specific version of the bootstrap algorithm that allows the estimation of the distribution of a statistical estimator when dealing with an NNM in a computationally effective way.
Abstract: Neural networks present characteristics where the results are strongly dependent on the training data, the weight initialisation, and the hyperparameters chosen. The determination of the distribution of a statistical estimator, as the Mean Squared Error (MSE) or the accuracy, is fundamental to evaluate the performance of a neural network model (NNM). For many machine learning models, as linear regression, it is possible to analytically obtain information as variance or confidence intervals on the results. Neural networks present the difficulty of not being analytically tractable due to their complexity. Therefore, it is impossible to easily estimate distributions of statistical estimators. When estimating the global performance of an NNM by estimating the MSE in a regression problem, for example, it is important to know the variance of the MSE. Bootstrap is one of the most important resampling techniques to estimate averages and variances, between other properties, of statistical estimators. In this tutorial, the application of resampling techniques (including bootstrap) to the evaluation of neural networks’ performance is explained from both a theoretical and practical point of view. The pseudo-code of the algorithms is provided to facilitate their implementation. Computational aspects, as the training time, are discussed, since resampling techniques always require simulations to be run many thousands of times and, therefore, are computationally intensive. A specific version of the bootstrap algorithm is presented that allows the estimation of the distribution of a statistical estimator when dealing with an NNM in a computationally effective way. Finally, algorithms are compared on both synthetically generated and real data to demonstrate their performance.

19 citations


Journal ArticleDOI
TL;DR: Using data resampling to reduce the negative effect of the imbalanced class problem and improve the accuracy of learned models of deep‐learning‐based fault localization.
Abstract: Many fault localization approaches recently utilize deep learning to learn an effective localization model showing a fresh perspective with promising results. However, localization models are generally learned from class imbalance datasets; that is, the number of failing test cases is much fewer than passing test cases. It may be highly susceptible to affect the accuracy of learned localization models. Thus, in this paper, we explore using data resampling to reduce the negative effect of the imbalanced class problem and improve the accuracy of learned models of deep‐learning‐based fault localization. Specifically, for deep‐learning‐based fault localization, its learning feature may require duplicate essential data to enhance the weak but beneficial experience incurred by the class imbalance datasets. We leverage the property of test cases (i.e., passing or failing) to identify failing test cases as the duplicate essential data and propose an iterative oversampling approach to resample failing test cases for producing a class balanced test suite. We apply the test case resampling to representative localization models using deep learning. Our empirical results on eight large‐sized programs with real faults and four large‐sized programs with seeded faults show that the test case resampling significantly improves fault localization effectiveness.

18 citations


Journal ArticleDOI
TL;DR: In this article, a large number of LSTM-NNs are constructed by resampling time-series sequences that were obtained from the early stage quantum evolution given by numerically exact multilayer multiconfigurational time dependent Hartree method.
Abstract: The recurrent neural network with the long short-term memory cell (LSTM-NN) is employed to simulate the long-time dynamics of open quantum systems. The bootstrap method is applied in the LSTM-NN construction and prediction, which provides a Monte Carlo estimation of a forecasting confidence interval. Within this approach, a large number of LSTM-NNs are constructed by resampling time-series sequences that were obtained from the early stage quantum evolution given by numerically exact multilayer multiconfigurational time-dependent Hartree method. The built LSTM-NN ensemble is used for the reliable propagation of the long-time quantum dynamics, and the simulated result is highly consistent with the exact evolution. The forecasting uncertainty that partially reflects the reliability of the LSTM-NN prediction is also given. This demonstrates the bootstrap-based LSTM-NN approach is a practical and powerful tool to propagate the long-time quantum dynamics of open systems with high accuracy and low computational cost.

18 citations


Journal ArticleDOI
TL;DR: In this article, different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross validation, and the random repeated train and test split to test their performance on several classification methods.
Abstract: Background: The bootstrap can be alternative to cross-validation as a training/test set splitting method since it minimizes the computing time in classification problems in comparison to the tenfold cross-validation. Objectives: Тhis research investigates what proportion should be used to split the dataset into the training and the testing set so that the bootstrap might be competitive in terms of accuracy to other resampling methods. Methods/Approach: Different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross-validation, and the random repeated train/test split to test their performance on several classification methods. The classification methods used include the logistic regression, the decision tree, and the k-nearest neighbours. Results: The findings suggest that using a different structure of the test set (e.g. 30/70, 20/80) can further optimize the performance of the bootstrap when applied to the logistic regression and the decision tree. For the k-nearest neighbour, the tenfold cross-validation with a 70/30 train/test splitting ratio is recommended. Conclusions: Depending on the characteristics and the preliminary transformations of the variables, the bootstrap can improve the accuracy of the classification problem.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a new benchmark models comparison framework for imbalanced credit scoring, and experimentally compare the performance of 10 benchmark resampling methods and nine benchmark classification models on six credit scoring data sets, and analyze the optimal combinations of them.

Journal ArticleDOI
TL;DR: In this paper, the authors evaluate the accuracy of snapshot resampling of both population and community-level metrics under a variety of conditions, including interannual variability in the response variable is low or the magnitude of change through time is high.
Abstract: Historical data sets can be useful tools to aid in understanding the impacts of global change on natural ecosystems. Resampling of historically sampled sites (“snapshot resampling”) has often been used to detect long‐term shifts in ecological populations and communities, because it allows researchers to avoid long‐term monitoring costs and investigate a large number of potential trends. But recent simulation‐based research has called the reliability of resampling into question, and its utility has not been comprehensively evaluated. Here we combine long‐term empirical data sets with novel community‐level simulations to explore the accuracy of snapshot resampling of both population‐ and community‐level metrics under a variety of conditions. We show that snapshot resampling often yields spurious conclusions, but the accuracy of results increases when inter‐annual variability in the response variable is low or the magnitude of change through time is high. Snapshot resampling also generally performs better for community‐level metrics (e.g., species richness) as opposed to population‐level metrics pertaining to a single species (e.g., abundance). Finally, we evaluate strategies to improve the accuracy of snapshot resampling, including sampling multiple years at the end of the study, but these produce mixed results. Ultimately, we find that snapshot resampling should be used with caution, but under certain circumstances, can be a useful for understanding long‐term global change impacts.

Journal ArticleDOI
TL;DR: In this article, the authors evaluate two general strategies for conducting covariate adjustment in Fisher's randomization test (frt) s: the pseudo-outcome strategy that uses the residuals from an outcome model with only the covariates as the pseudo, covariate-adjusted outcomes to form the test statistic, and the model-output strategy that directly uses the output from a model with both the treatment and covariate as the covariate adjusted test statistic.

Journal ArticleDOI
TL;DR: In this article, the authors provide a framework for conducting valid inference for causal parameters without imposing strong variance or support restrictions on the propensity score, in particular for irregularly identified treatment effect parameters.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an improved genetic optimization based resampling method, which optimizes the distribution of resampled particles by optimizing the five operators, i.e., selection, roughening, classification, crossover, and mutation.
Abstract: In indoor target tracking based on wireless sensor networks, the particle filtering algorithm has been widely used because of its outstanding performance in coping with highly non-linear problems. Resampling is generally required to address the inherent particle degeneracy problem in the particle filter. However, traditional resampling methods cause the problem of particle impoverishment. This problem degrades positioning accuracy and robustness and sometimes may even result in filtering divergence and tracking failure. In order to mitigate the particle impoverishment and improve positioning accuracy, this paper proposes an improved genetic optimization based resampling method. This resampling method optimizes the distribution of resampled particles by the five operators, i.e., selection, roughening, classification, crossover, and mutation. The proposed resampling method is then integrated into the particle filtering framework to form a genetic optimization resampling based particle filtering (GORPF) algorithm. The performance of the GORPF algorithm is tested by a one-dimensional tracking simulation and a three-dimensional indoor tracking experiment. Both test results show that with the aid of the proposed resampling method, the GORPF has better robustness against particle impoverishment and achieves better positioning accuracy than several existing target tracking algorithms. Moreover, the GORPF algorithm owns an affordable computation load for real-time applications.

Journal ArticleDOI
TL;DR: A scheme that spatially couples two gyrokinetic codes using first-principles, using a five-dimensional (5D) grid to communicate the distribution function between the two codes.
Abstract: We present a scheme that spatially couples two gyrokinetic codes using first-principles Coupled equations are presented and a necessary and sufficient condition for ensuring accuracy is derived This new scheme couples both the field and the particle distribution function The coupling of the distribution function is only performed once every few time-steps, using a five-dimensional (5D) grid to communicate the distribution function between the two codes This 5D grid interface enables the coupling of different types of codes and models, such as particle and continuum codes, or delta-f and total-f models Transferring information from the 5D grid to the marker particle weights is achieved using a new resampling technique Demonstration of the coupling scheme is shown using two XGC gyrokinetic simulations for both the core and edge We also apply the coupling scheme to two continuum simulations for a one-dimensional advection–diffusion problem

Journal ArticleDOI
TL;DR: In this article, the performance of PLSe1 and PLSe2 with non-normal data based on a Monte Carlo simulation using a simple and a complex model has been evaluated under both normality and non-normality conditions.
Abstract: PLSe1 and PLSe2 methods were developed in 2013. While the performance of PLSe1 under normality and non-normality conditions has been confirmed, the performance of PLSe2, proposed to provide an avenue for the resurrection of PLS as a fully justified statistical methodology, has not yet been verified under non-normality condition. For this reason, our study aims at testing the performance of PLSe2 with non-normal data based on a Monte Carlo simulation using a simple and a complex model. In addition, it aims at providing a step-by-step visual guideline on how to apply this method in estimating a simple mediation model using EQS 6.4. The results of the Monte Carlo simulations across different numbers of replications and sample sizes provided substantial support for the performance of PLSe2 under non-normality conditions since the produced estimates were unbiased and virtually identical to the parameters resulted from the traditional ML estimation. In addition, we provided evidence about the suitability of different robust test statistics for the purpose of model evaluation based on our simulation results. Regarding the empirical example, we estimated a mediation model using ML, PLSe2, and PLSc estimators, compared the results across these methods, and provided further support for our PLSe2 and ML results through running a resampling bootstrap simulation. Overall, while we empirically validated the PLSe2 method using Monte Carlo simulations, our findings suggest that PLSe2 has the advantages of both ML and PLS and performs well under non-normality (and normality) conditions, thereby suggesting it as the methodology of choice for model specification, estimation, and evaluation in social sciences empirical studies.

Journal ArticleDOI
TL;DR: In this paper, the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways were compared.

Journal ArticleDOI
TL;DR: An intelligent PF method with an adaptive Metropolis–Hastings (M-H) resampling strategy that offers accuracy improvements in the liquid-level estimation during the silicon crystal growth is proposed.
Abstract: During the growth of silicon single crystals, it is critical to detect the liquid level of the silicon melt to ensure their high-quality production. Because noise statistics are difficult to determine in measured values of the liquid level, a particle filter (PF) with unknown statistics has been presented to estimate the liquid level. However, this approach leads to inaccurate results due to sample impoverishment. To alleviate this problem, we propose an intelligent PF method with an adaptive Metropolis–Hastings (M-H) resampling strategy. To accomplish this, we first design an M-H resampling strategy with two proposed distributions to resample low-weight particles. These distributions randomly select high-weight particles for the Gaussian mutations or high-weight and low-weight particles for crossover operations, so as to promote the movement of low-weight particles to high-probability regions. We also construct a self-adaptive function to further improve the overall particle quality, which is used to calculate the selection probability of these two proposed distributions according to the proportion of low-weight particles in all of the particles. Finally, the liquid level is estimated according to the particles after the modified resampling strategy is applied. A comparative evaluation of the proposed method with the adaptive genetic particle filter (AGPF) and the firefly algorithm intelligence optimized particle filter (FAIOPF) is conducted. Some results of the simulation and the practical experiment are presented; they indicate the proposed method offers accuracy improvements in the liquid-level estimation during the silicon crystal growth. More specifically, compared with the AGPF and the FAIOPF, the mean absolute error (MAE) of the proposed method has been reduced by approximately 53.3% and 99.5%, respectively.

Journal ArticleDOI
TL;DR: In each group of closely related species (mostly congeneric), it is found that a species can be identified fairly accurately even when means are based on relatively small samples, although errors are frequent with fewer specimens and primates more prone to inaccuracies.
Abstract: An accurate classification is the basis for research in biology. Morphometrics and morphospecies play an important role in modern taxonomy, with geometric morphometrics increasingly applied as a favourite analytical tool. Yet, really large samples are seldom available for modern species and even less common in palaeontology, where morphospecies are often identified, described and compared using just one or a very few specimens. The impact of sampling error and how large a sample must be to mitigate the inaccuracy are important questions for morphometrics and taxonomy. Using more than 4000 crania of adult mammals and taxa representing each of the four placental superorders, we assess the impacts of sampling error on estimates of species means, variances and covariances in Procrustes shape data using resampling experiments. In each group of closely related species (mostly congeneric), we found that a species can be identified fairly accurately even when means are based on relatively small samples, although errors are frequent with fewer specimens and primates more prone to inaccuracies. A precise reconstruction of similarity relationships, in contrast, sometimes requires very large samples (> 100), but this varies widely depending on the study group. Medium-sized samples are necessary to accurately estimate standard errors of mean shapes or intraspecific variance covariance structure, but in this case minimum sample sizes are broadly similar across all groups (≈ 20–50 individuals). Overall, thus, the minimum sample sized required for a study varies across taxa and depends on what is being assessed, but about 25–40 specimens (for each sex, if a species is sexually dimorphic) may be on average an adequate and attainable minimum sample size for estimating the most commonly used shape parameters. As expected, the best predictor of the effects of sampling error is the ratio of between- to within-species variation: the larger the ratio, the smaller the sample size needed to obtain the same level of accuracy. Even though ours is the largest study to date of the uncertainties in estimates of means, variances and covariances in geometric morphometrics, and despite its generally high congruence with previous analyses, we feel it would be premature to generalize. Clearly, there is no a priori answer for what minimum sample size is required for a particular study and no universal recipe to control for sampling error. Exploratory analyses using resampling experiments are thus desirable, easy to perform and yield powerful preliminary clues about the effect of sampling on parameter estimates in comparative studies of morphospecies, and in a variety of other morphometric applications in biology and medicine. Morphospecies descriptions are indeed a small piece of provisional evidence in a much more complex evolutionary puzzle. However, they are crucial in palaeontology, and provide important complimentary evidence in modern integrative taxonomy. Thus, if taxonomy provides the bricks for accurate research in biology, understanding the robustness of these bricks is the first fundamental step to build scientific knowledge on sound, stable and long-lasting foundations.

Journal ArticleDOI
TL;DR: Two methods are proposed in this article to calculate the confidence intervals of the loading values of the SPCA techniques, based on a resampling technique and an estimate of the error variance of data, which lead to sparse structures of PCs.

Journal ArticleDOI
TL;DR: The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics, including new measures specifically designed for imbalanced data classification.
Abstract: In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.

Journal ArticleDOI
TL;DR: The experimental evaluation with 23 hierarchical datasets across different domains and characteristics showed that the proposed resampling algorithms significantly improve the classification performance.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors extracted 7658 molecular structure features from 2354 drug molecule SMILE strings using computational methods, and used three feature selection algorithms of machine learning to construct multiple discriminant models.
Abstract: Computer-aided drug design is an efficient method to analyze the development of disease-related drugs. However, developed as binding targets, medicines perform well in cell models and animal models but fail in human models. One main reason for this failure is that the human body has natural barriers, such as the blood-brain barrier, to block exogenous macromolecules. Thus, efficient and accurate predictions of drug molecules that can effectively pass the blood-brain barrier is necessary in developing drug treatments for brain tissue diseases. In this study, 7658 molecular structure features were extracted from 2354 drug molecule SMILE strings using computational methods. By integrating three feature selection algorithms of machine learning, 33 chemical structure features with significantly discriminant performance were screened out and used to construct multiple discriminant models. After a comprehensive comparison, the XGBoost model was selected as the final prediction model. After data preprocessing and parameter optimization, the model achieved 95% accuracy on the training set. To verify the model’s stability, we introduced an external data set, which reached 96% accuracy of the model. This study applies new resampling methods and machine learning algorithms, and adjusts the application of resampling methods to obtain new chemical features to construct machine learning predictors. The features may contribute to the significant drug development that integrates biological analysis and machine learning algorithms.

Proceedings ArticleDOI
22 Mar 2021
TL;DR: In this paper, the authors present a new methodology for assessing the behavior of significance tests in typical ranking tasks and conclude that the Wilcoxon test is the most reliable test and thus IR practitioners should adopt it as the reference tool to assess differences between IR systems.
Abstract: Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.

Journal ArticleDOI
TL;DR: A dual-filtering based convolutional neural network to extract features directly from the images to achieve resampling parameter estimation and has better performance than state-of-the-art methods.
Abstract: Resampling detection is an important problem in image forensics. Several exiting approaches have been proposed to solve it, but few of them focus on resampling parameter estimation. Especially, the estimation of downsampling scenarios is very challenging. In this paper, we propose a dual-filtering based convolutional neural network (CNN) to extract features directly from the images. First, we analyze the formulation of resampling parameter estimation and reformulate it as a multi-classification problem by regarding each resampling parameter as a distinct class. Then, we design a network structure based on the preprocessing operation to capture the specific resampling traces for classification. Two parallel filters with different highpass filters are deployed to the CNN architecture, which enlarges the resampling traces and makes it easier to achieve resampling parameter estimation. Next, concatenating the outputs of the two filters by a “concat” layer. Finally, the experimental results demonstrate our proposed method is effective and has better performance than state-of-the-art methods in resampling parameter estimation.

Posted Content
TL;DR: A new method of online inference for a vector of parameters estimated by the Polyak-Ruppert averaging procedure of stochastic gradient descent (SGD) algorithms is developed, which is fully operational with online data and is rigorously underpinned by a functional central limit theorem.
Abstract: We develop a new method of online inference for a vector of parameters estimated by the Polyak-Ruppert averaging procedure of stochastic gradient descent (SGD) algorithms. We leverage insights from time series regression in econometrics and construct asymptotically pivotal statistics via random scaling. Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem. Our proposed inference method has a couple of key advantages over the existing methods. First, the test statistic is computed in an online fashion with only SGD iterates and the critical values can be obtained without any resampling methods, thereby allowing for efficient implementation suitable for massive online data. Second, there is no need to estimate the asymptotic variance and our inference method is shown to be robust to changes in the tuning parameters for SGD algorithms in simulation experiments with synthetic data.

Journal ArticleDOI
TL;DR: A comprehensive experimental methodology combining exploratory and statistical validation stages, which uses resampling techniques to minimize the sampling cost, and statistical significance tests to identify strengths and weaknesses of individual features is proposed.
Abstract: The inherent difficulty of solving a continuous, static, bound-constrained and single-objective black-box optimization problem depends on the characteristics of the problem’s fitness landscape and the algorithm being used. Exploratory landscape analysis (ELA) uses numerical features generated via a sampling process of the search space to describe such characteristics. Despite their success in a number of applications, these features have limitations related with the computational costs associated with generating accurate results. Consequently, only approximations are available in practice which may be unreliable, leading to systemic errors. The overarching aim of this paper is to evaluate the reliability of five well-known ELA feature sets across multiple dimensions and sample sizes. For this purpose, we propose a comprehensive experimental methodology combining exploratory and statistical validation stages, which uses resampling techniques to minimize the sampling cost, and statistical significance tests to identify strengths and weaknesses of individual features. The data resulting from the methodology is collected and made available in the LEarning and OPtimization Archive of Research Data v1.0. The results show that instances of the same function can have feature values that are significantly different; hence, non-generalizable across instances, due to the effects produced by the boundary constraints. In addition, some landscape features under evaluation are highly volatile, and strongly susceptible to changes in sample size. Finally, the results show evidence of a curse of modality, meaning that the sample size should increase with the number of local optima.

02 Mar 2021
TL;DR: This paper proposes a novel algorithm that exploits the benefits of the PMC framework and includes more efficient adaptive mechanisms, exploiting geometric information of the target distribution, and shows the successful performance of the proposed method in three numerical examples, involving challenging distributions.
Abstract: Adaptive importance sampling (AIS) methods are increasingly used for the approximation of distributions and related intractable integrals in the context of Bayesian inference. Population Monte Carlo (PMC) algorithms are a subclass of AIS methods, widely used due to their ease in the adaptation. In this paper, we propose a novel algorithm that exploits the benefits of the PMC framework and includes more efficient adaptive mechanisms, exploiting geometric information of the target distribution. In particular, the novel algorithm adapts the location and scale parameters of a set of importance densities (proposals). At each iteration, the location parameters are adapted by combining a versatile resampling strategy (i.e., using the information of previous weighted samples) with an advanced optimization-based scheme. Local second-order information of the target distribution is incorporated through a preconditioning matrix acting as a scaling metric onto a gradient direction. A damped Newton approach is adopted to ensure robustness of the scheme. The resulting metric is also used to update the scale parameters of the proposals. We discuss several key theoretical foundations for the proposed approach. Finally, we show the successful performance of the proposed method in three numerical examples, involving challenging distributions.