Open accessJournal Article

# SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

02 Mar 2021-Vol. 4, Iss: 1, pp 18
Abstract: Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

Topics: Categorical variable (51%), Synthetic data (50%)
##### Citations
More

5 results found

Open accessJournal Article
15 Mar 2021-
Abstract: An enormous amount of clinical free-text information, such as pathology reports, progress reports, clinical notes and discharge summaries have been collected at hospitals and medical care clinics. These data provide an opportunity of developing many useful machine learning applications if the data could be transferred into a learn-able structure with appropriate labels for supervised learning. The annotation of this data has to be performed by qualified clinical experts, hence, limiting the use of this data due to the high cost of annotation. An underutilised technique of machine learning that can label new data called active learning (AL) is a promising candidate to address the high cost of the label the data. AL has been successfully applied to labelling speech recognition and text classification, however, there is a lack of literature investigating its use for clinical purposes. We performed a comparative investigation of various AL techniques using ML and deep learning (DL)-based strategies on three unique biomedical datasets. We investigated random sampling (RS), least confidence (LC), informative diversity and density (IDD), margin and maximum representativeness-diversity (MRD) AL query strategies. Our experiments show that AL has the potential to significantly reducing the cost of manual labelling. Furthermore, pre-labelling performed using AL expediates the labelling process by reducing the time required for labelling.

Topics: , Supervised learning (54%),  ... show more

13 Citations

Open accessJournal Article
03 Aug 2021-IEEE Access
Abstract: Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/

Topics: Resampling (54%), Random forest (52%)

5 Citations

Open accessJournal Article
24 Jun 2021-Sensors
Abstract: Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defences. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The dataset contains 44 network features and an unbalanced distribution of classes. In this work, the capability of the dataset for formulating machine-learning-based models was experimentally evaluated. To investigate the stability of the obtained models, cross-validation was performed, and an array of detection metrics were reported. The gathered dataset is part of an effort to bring security against novel cyberthreats and was completed in the SIMARGL project.

Topics:

2 Citations

Open accessPosted Content
Luke Jordan1Institutions (1)
13 Oct 2021-arXiv: Learning
Abstract: The special and important problems of default prediction for municipal bonds are addressed using a combination of text embeddings from a pre-trained transformer network, a fully connected neural network, and synthetic oversampling. The combination of these techniques provides significant improvement in performance over human estimates, linear models, and boosted ensemble models, on data with extreme imbalance. Less than 0.2% of municipal bonds default, but our technique predicts 9 out of 10 defaults at the time of issue, without using bond ratings, at a cost of false positives on less than 0.1% non-defaulting bonds. The results hold the promise of reducing the cost of capital for local public goods, which are vital for society, and bring techniques previously used in personal credit and public equities (or national fixed income), as well as the current generation of embedding techniques, to sub-sovereign credit decisions.

Topics: Bond credit rating (55%), Fixed income (55%), Default (54%) ... show more

Open accessProceedings Article
Jackson Zhou1, Matloob Khushi2, Mohammad Ali Moni3, Shahadat Uddin2  +1 moreInstitutions (3)
01 Sep 2021-
Abstract: The high incidence and low survival rate of lung cancers contribute to their high death count, and drive the development of lung cancer prediction models using demographic factors. The five year relative survival rate of small cell lung cancer in particular (6%) is four times less than that of non small cell lung cancer (23%), though no predictive models have been developed for it so far. This study aimed to expand on previous lung cancer prediction studies and develop improved models for general and small cell lung cancer prediction. Established machine learning models were considered, in addition to a novel curriculum learning based deep neural network. All models were evaluated using data from the National Cancer Institute's Prostate, Lung, Colorectal and Ovarian Cancer screening trial, with performance measured using the area under the receiver operator characteristic curve (AUROC). Random forest models were found to give the best performances in lung cancer prediction (bootstrap optimism corrected (BOC) $\text{AUROC}\ = {0.927}$ ), outperforming previous logistic regression models $(\text{BOC} \text{AUROC} ={0.859})$ . Additionally, curriculum learning based neural networks were shown to outperform all other model types for small cell lung cancer prediction in particular (AUROCs of 0.873 and 0.882 across two feature sets). To conclude, high-performance models were developed for general and small cell lung cancer prediction, and could help improve non-invasive lung cancer prediction in a clinical setting.

Topics: Cancer (54%), Lung cancer (53%)
##### References
More

26 results found

Open accessJournal ArticleDOI: 10.1613/JAIR.953
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

Topics: ,

11,512 Citations

Open accessJournal ArticleDOI: 10.1613/JAIR.953
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

Topics:

11,077 Citations

Journal Article
Abstract: There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.

Topics:

2,265 Citations

Book Chapter
Hui Han1, Wenyuan Wang1, Binghuan Mao2Institutions (2)
23 Aug 2005-
Abstract: In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority over-sampling technique (SMOTE) is one of the over-sampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over-sampled. For the minority class, experiments show that our approaches achieve better TP rate and F-value than SMOTE and random over-sampling methods.

1,894 Citations

Open accessProceedings Article
Haibo He1, Yang Bai1, E.A. Garcia1, Shutao Li2Institutions (2)
01 Jun 2008-
Abstract: This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples. Simulation analyses on several machine learning data sets show the effectiveness of this method across five evaluation metrics.