A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

doi:10.1109/ACCESS.2021.3102399

Open AccessJournal ArticleDOI

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Matloob Khushi, +7 more

- 03 Aug 2021 -

IEEE Access

- Vol. 9, pp 109960-109975

Chats0

TLDR

In this article, the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) was used to identify the best imbalance techniques suitable for medical datasets.

Abstract:

Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Citations

An Efficient Deep Learning-Based Skin Cancer Classifier for an Imbalanced Dataset

A Novel Method for Performance Measurement of Public Educational Institutions Using Machine Learning Models

An Efficient Deep Learning Model to Detect COVID-19 Using Chest X-ray Images

Bias and Class Imbalance in Oncologic Data—Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets

Wearable IMU-Based Human Activity Recognition Algorithm for Clinical Balance Assessment Using 1D-CNN and GRU Ensemble Model.

References

Class Imbalance Problem in Data Mining Review

On the Class Imbalance Problem

An instance level analysis of data complexity

The effect of class distribution on classifier learning: an empirical study

Evaluation Measures for Models Assessment over Imbalanced Data Sets

Related Papers (5)

A study on classifying imbalanced datasets

Data mining approaches to predict final grade by overcoming class imbalance problem

Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity.

Application of Random-SMOTE on Imbalanced Data Mining