scispace - formally typeset
Journal ArticleDOI

Handling Class Overlapping to Detect Noisy Instances in Classification

TLDR
It is found that class overlap is a principal contributor to introduce class noise in data sets by using four noise filters to identify the overlapped instances and find the noisy instances.
Abstract
Automated machine classification will play a vital role in the machine learning and data mining. It is probable that each classifier will work well on some data sets and not so well in others, increasing the evaluation significance. The performance of the learning models will intensely rely on upon the characteristics of the data sets. The previous outcomes recommend that overlapping between classes and the presence of noise has the most grounded impact on the performance of learning algorithm. The class overlap problem is a critical problem in which data samples appear as valid instances of more than one class which may be responsible for the presence of noise in data sets.The objective of this paper is to comprehend better the data used as a part of machine learning problems so as to learn issues and to analyze the instances that are profoundly covered by utilizing new proposed overlap measures. The proposed overlap measures are Nearest Enemy Ratio, SubConcept Ratio, Likelihood Ratio and Soft Margin Ratio. To perform this experiment, we have created 438 binary classification data sets from real-world problems and computed the value of 12 data complexity metrics to find highly overlapped data sets. After that we apply measures to identify the overlapped instances and four noise filters to find the noisy instances. From results, we found that 60–80% overlapped instances are noisy instances in data sets by using four noise filters. We found that class overlap is a principal contributor to introduce class noise in data sets.

read more

Citations
More filters
Journal ArticleDOI

A novel progressively undersampling method based on the density peaks sequence for imbalanced data

TL;DR: Zhang et al. as discussed by the authors proposed a novel under-sampling method for imbalanced data, which exploits a sequence of density peaks to progressively extract instances from the majority classes of the imbalance data.
Journal ArticleDOI

A New Under-Sampling Method to Face Class Overlap and Imbalance

TL;DR: A two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers.
Journal ArticleDOI

A comparative study on online machine learning techniques for network traffic streams analysis

TL;DR: In this paper , the authors investigate and compare the OL techniques that facilitate data stream analytics in the networking domain and highlight the advantages of online learning in this regard, as well as the challenges associated with OL-based network traffic stream analysis, e.g., concept drift and the imbalanced classes.
Journal ArticleDOI

Empirical Comparisons for Combining Balancing and Feature Selection Strategies for Characterizing Football Players Using FIFA Video Game System

TL;DR: In this article, a large-scale comparison of characterizing football players into nine positions by using FIFA video game data is presented, whereas most of the previous studies in this field have focused on characterizing players into only three classes according to their positions.
Journal ArticleDOI

An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult

TL;DR: A contemporary empirical study of the behaviour and performance of five well-known classifiers on a large number of imbalanced datasets exhibiting numerous combinations of the stated characteristics to identify and rank difficulty factors when learning from imbalanced data, depending on the type of classification algorithm used.
References
More filters
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Journal ArticleDOI

Nearest neighbor pattern classification

TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
Journal ArticleDOI

Statistical pattern recognition: a review

TL;DR: The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.
Journal ArticleDOI

Learning from Imbalanced Data

TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Related Papers (5)