scispace - formally typeset
Search or ask a question
Book ChapterDOI

Empirical Analysis of Classification Algorithms in Data Stream Mining

01 Jan 2021-pp 657-669
TL;DR: This paper has focused discussing data stream classification algorithms and simulated the same with real and synthetic dataset to understand performance parameters of discussed algorithms.
Abstract: Data stream mining has taken over as a new field of research during past few years. It has gained lot of attention recently due to its challenging characteristics like dynamic nature, huge data size and continuous flow, temporal, etc. Processing and classifying these types of data confront many issues in terms of storage and analysis both. Moreover, existing traditional classification algorithms do not fit well with data stream, as they process over the data which is stored in memory for once and all. Data streams if taken up for mining can render very crucial information for any non-stationary system from which it is generated. Also, storing data streams is not feasible as storage cost increases with the increasing data size. But the algorithm designed for data streams should have characteristics which address incremental and multi-pass approach to deal with new data and to analyze exiting at the same time. Data stream classification aims at labeling data, and it is nearly impossible to do in real life due to the characteristics of data which act as challenges. Traditional data mining algorithm fits limited number of instances, and this model would not work with data stream. In this paper, we have focused discussing data stream classification algorithms and simulated the same with real and synthetic dataset to understand performance parameters of discussed algorithms.
Citations
More filters
Journal ArticleDOI
TL;DR: The performance of Bayes net (94.5%) and random forest technologies indicates optimum performance rather than the sequential minimal optimization (SMO) and multilayer perceptron (MLP) methods.
Abstract: Data mining is defined as a search through large amounts of data for valuable information. The association rules, grouping, clustering, prediction, sequence modeling is some essential and most general strategies for data extraction. The processing of data plays a major role in the healthcare industry's disease detection. A variety of disease evaluations should be required to diagnose the patient. However, using data mining strategies, the number of examinations should be decreased. This decreased examination plays a crucial role in terms of time and results. Heart disease is a death-provoking disorder. In this recent instance, health issues are immense because of the availability of health issues and the grouping of various situations. Today, secret information is important in the healthcare industry to make decisions. For the prediction of cardiovascular problems, (Weka 3.8.3) tools for this analysis are used for the prediction of data extraction algorithms like sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net. The data collected combine the prediction accuracy results, the receiver operating characteristic (ROC) curve, and the PRC value. The performance of Bayes net (94.5%) and random forest (94%) technologies indicates optimum performance rather than the sequential minimal optimization (SMO) and multilayer perceptron (MLP) methods.

7 citations


Cites background from "Empirical Analysis of Classificatio..."

  • ...A strong system has a very high absolute mean error [30]....

    [...]

Journal ArticleDOI
TL;DR: A hyper model is created to optimize the VFDT algorithm, which reduces the waste of energy while maintaining accuracy and there was a noticeable development in the performance of the algorithm in terms of reducing its energy consumption and maintaining its accuracy levels.
Abstract: : Traditional machine learning (ML) algorithms use static datasets to model knowledge. Nowadays, there is an increasing demand for machine learning based solutions that can handle very huge amounts of data in the shape of streams that never stop. The Very Fast Decision Tree (VFDT) is one of the most widely utilized data stream mining algorithms (DSM), despite the fact that it wastes a huge amount of energy on trivial calculations. The machine learning community has come first in terms of accuracy and execution time while designing algorithms like this. When assessing data mining algorithms, numerous types of studies include energy usage as a crucial factor. The purpose of this research is to create a hyper model to optimize the VFDT algorithm, which reduces the waste of energy while maintaining accuracy. In the proposed method, some fixed algorithm parameters were changed to dynamic parameters after analyzing each of them separately and knowing the extent of their positive impact on reducing energy consumption in several cases in algorithm. The practical experiment was conducted on both the algorithm in its basic form and the algorithm in the proposed form on several different types of datasets in the same application environment The main advantage of the results of the proposed method compared to the results of the basic algorithm is that there was a noticeable development in the performance of the algorithm in terms of reducing its energy consumption and maintaining its accuracy levels.
Journal ArticleDOI
TL;DR: In this paper , the authors proposed a mechanism for upgrading the algorithm's energy usage and restricting computational resources, without compromising the accuracy and execution time, and showed that the proposed algorithm works considerably better and faster with less energy while maintaining accuracy.
Abstract: Traditional machine learning (ML) techniques model knowledge using static datasets. With the increased use of the Internet in today's digital world, a massive amount of data is generated at an accelerated rate that must be handled. This data must be handled as soon as it arrives because it is continuous, and cannot be kept for a long period of time. Various methods exist for mining data from streams. When developing methods like these, the machine learning community put accuracy and execution time first. Numerous sorts of studies take energy consumption into consideration while evaluating data mining methods. However, this work concentrates on Very Fast Decision Tree, which is the most often used technique in data flow classification, despite the fact that it wastes a huge amount of energy on trivial calculations. The research presents a proposed mechanism for upgrading the algorithm's energy usage and restricts computational resources, without compromising the algorithm's efficiency. The mechanism has two stages: the first is to eliminate a set of bad features that increase computational complexity and waste energy, and the second is to group the good features into a candidate group that will be used instead of using all of the attributes in the next iteration. Experiments were conducted on real-world benchmark and synthetic datasets to compare the proposed method to state-of-the-art algorithms in previous works. The proposed algorithm works considerably better and faster with less energy while maintaining accuracy. Keywords—Classification; energy consumption; Hoeffding bound; Information gain; massive online analysis; stream data; very fast decision tree
References
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Journal Article
TL;DR: MOA includes a collection of offline and online methods as well as tools for evaluation that implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves.
Abstract: Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams MOA includes a collection of offline and online methods as well as tools for evaluation In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naive Bayes classifiers at the leaves MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license

1,373 citations

Proceedings ArticleDOI
26 Aug 2001
TL;DR: A fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift is presented.
Abstract: Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixed-size ensemble using a heuristic replacement strategy. The result is a fast algorithm for large-scale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift.

1,184 citations

Proceedings Article
04 Jan 2001
TL;DR: This paper presents online versions of bagging and boosting that require only one pass through the training data and compares the online and batch algorithms experimentally in terms of accuracy and running time.

818 citations

Proceedings ArticleDOI
01 Jan 2005
TL;DR: In this paper, the authors present online versions of bagging and boosting that require only one pass through the training data, and compare the online and batch algorithms experimentally in terms of accuracy and running time.
Abstract: Bagging and boosting are two of the most well-known ensemble learning methods due to their theoretical performance guarantees and strong experimental results. However, these algorithms have been used mainly in batch mode, i.e., they require the entire training set to be available at once and, in some cases, require random access to the data. In this paper, we present online versions of bagging and boosting that require only one pass through the training data. We build on previously presented work by describing some theoretical results. We also compare the online and batch algorithms experimentally in terms of accuracy and running time.

502 citations