scispace - formally typeset
Search or ask a question
Book ChapterDOI

A Comparative Analysis on Ensemble Classifiers for Concept Drifting Data Streams

TL;DR: A survey on various ensemble classifiers for learning in data stream mining is provided and their performance on accuracy, memory, and time on synthetic and real datasets with different drift scenarios are compared.
Abstract: Mining in data stream plays a vital role in Big Data analytics. Traffic management, sensor networks and monitoring, weblogs analysis are the application of dynamic environments which generate streaming data. In a dynamic environment, data arrives at high speed and algorithms that process them need to fulfill the constraints on limited memory, computation time, and one-time scan of incoming data. The significant challenge in data stream mining is data distribution changes over a time period which is called concept drifts. So, learning model need to detect the changes and adapt according to that model. By nature, ensemble classifiers are adapting to changes very well and deal the concept drift very well. Three ensemble-based approaches were used to handle the concept drift: online, block-based ensemble, and hybrid approaches. We provide a survey on various ensemble classifiers for learning in data stream mining. Finally, we compare their performance on accuracy, memory, and time on synthetic and real datasets with different drift scenarios.
Citations
More filters
Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the authors discuss the importance of preprocessing big data data in terms of analysis time, utilized resources percentage, storage, efficiency of analyzed data and the output gained information.
Abstract: Big data is a trending word in the industry and academia that represents the huge flood of collected data, this data is very complex in its nature. Big data as a term used to describe many concepts related to the data from technological and cultural meaning. In the big data community, big data analytics is used to discover the hidden patterns and values that give an accurate representation of the data. Big data preprocessing is considered an important step in the analysis process. It a key to the success of the analysis process in terms of analysis time, utilized resources percentage, storage, the efficiency of the analyzed data and the output gained information. Preprocessing data involves dealing with concepts like concept drift, data streams that are considered as significant challenges.

7 citations

Journal ArticleDOI
TL;DR: In this article, a hybrid block-based ensemble framework is proposed for multi-class classification in evolving data streams, which integrates the main pros of an online drift detector for a k-class problem and the concept blockbased weighting with a view to react to different types of drift.
Abstract: Data stream mining is an important research topic that has received increasing attention due to its use in a wide range of applications, such as sensor networks, banking, and telecommunication. The phenomenon of data streams evolving over time is known as concept drift. In addition, the presence of multiple classes aggravates the problem of a loss in performance during the process of drift detection in data streams. Several drift detectors and ensemble approaches have been widely employed, however they either incur a high cost in terms of memory consumption and run time or ensemble approaches may respond slowly due to using outdated blocks to train classifiers. Motivated by this, we propose a hybrid block-based ensemble, which is a framework for multi-class classification in evolving data streams. The multi-class framework aims to integrate the main pros of an online drift detector for a k-class problem and the concept block-based weighting with a view to react to different types of drifts. The experimental evaluations on well-known synthetic and real-world datasets through a comprehensive comparison upon eleven drift detectors and five ensemble approaches, it shows that our proposed algorithms performs significantly better than other drift detectors and ensemble approaches.

4 citations

Book ChapterDOI
01 Jan 2021
TL;DR: This paper has focused discussing data stream classification algorithms and simulated the same with real and synthetic dataset to understand performance parameters of discussed algorithms.
Abstract: Data stream mining has taken over as a new field of research during past few years. It has gained lot of attention recently due to its challenging characteristics like dynamic nature, huge data size and continuous flow, temporal, etc. Processing and classifying these types of data confront many issues in terms of storage and analysis both. Moreover, existing traditional classification algorithms do not fit well with data stream, as they process over the data which is stored in memory for once and all. Data streams if taken up for mining can render very crucial information for any non-stationary system from which it is generated. Also, storing data streams is not feasible as storage cost increases with the increasing data size. But the algorithm designed for data streams should have characteristics which address incremental and multi-pass approach to deal with new data and to analyze exiting at the same time. Data stream classification aims at labeling data, and it is nearly impossible to do in real life due to the characteristics of data which act as challenges. Traditional data mining algorithm fits limited number of instances, and this model would not work with data stream. In this paper, we have focused discussing data stream classification algorithms and simulated the same with real and synthetic dataset to understand performance parameters of discussed algorithms.

3 citations

Journal ArticleDOI
TL;DR: This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years, which are structured based on the stream classification process to facilitate coordination within this complex topic.
Abstract: Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.

1 citations

References
More filters
01 Jan 2007

17,341 citations

Book
01 Jan 2005
TL;DR: This book first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently.
Abstract: This book organizes key concepts, theories, standards, methodologies, trends, challenges and applications of data mining and knowledge discovery in databases. It first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. It also gives in-depth descriptions of data mining applications in various interdisciplinary industries.

2,836 citations

Journal ArticleDOI
TL;DR: The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art and aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Abstract: Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

2,374 citations

Journal ArticleDOI
TL;DR: A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm, which is robust in the presence of errors in the data, and is called the Weighted Majority Algorithm.
Abstract: We study the construction of prediction algorithms in a situation in which a learner faces a sequence of trials, with a prediction to be made in each, and the goal of the learner is to make few mistakes. We are interested in the case where the learner has reason to believe that one of some pool of known algorithms will perform well, but the learner does not know which one. A simple and effective method, based on weighted voting, is introduced for constructing a compound algorithm in such a circumstance. We call this method the Weighted Majority Algorithm. We show that this algorithm is robust in the presence of errors in the data. We discuss various versions of the Weighted Majority Algorithm and prove mistake bounds for them that are closely related to the mistake bounds of the best algorithms of the pool. For example, given a sequence of trials, if there is an algorithm in the pool A that makes at most m mistakes then the Weighted Majority Algorithm will make at most c(log |A| + m) mistakes on that sequence, where c is fixed constant.

2,093 citations

Journal ArticleDOI
TL;DR: A family of learning algorithms that flexibly react to concept drift and can take advantage of situations where contexts reappear are described, including a heuristic that constantly monitors the system's behavior.
Abstract: On-line learning in domains where the target concept depends on some hidden context poses serious problems. A changing context can induce changes in the target concepts, producing what is known as concept drift. We describe a family of learning algorithms that flexibly react to concept drift and can take advantage of situations where contexts reappear. The general approach underlying all these algorithms consists of (1) keeping only a window of currently trusted examples and hypotheses; (2) storing concept descriptions and reusing them when a previous context re-appears; and (3) controlling both of these functions by a heuristic that constantly monitors the system's behavior. The paper reports on experiments that test the systems' perfomance under various conditions such as different levels of noise and different extent and rate of concept drift.

1,614 citations