scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Survey on Classification Techniques for Data Mining

17 Dec 2015-International Journal of Computer Applications (Foundation of Computer Science (FCS), NY, USA)-Vol. 132, Iss: 4, pp 13-16
TL;DR: The different classification methods and classifiers that can be used for classification of observations that are initially uncategorized are compared to demonstrate the different accuracies and usefulness of classifiers.
Abstract: This paper focuses on the various techniques that can be implemented for classification of observations that are initially uncategorized. Our objective is to compare the different classification methods and classifiers that can be used for this purpose. In this paper, we study and demonstrate the different accuracies and usefulness of classifiers and the circumstances in which they should be implemented. General Terms Classification, Sentiment, Review, Accuracy, Positive, Negative, Neutral.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This survey takes an interdisciplinary approach to cover studies related to CatBoost in a single work, and provides researchers an in-depth understanding to help clarify proper application of Cat boost in solving problems.
Abstract: Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

247 citations


Cites methods from "Survey on Classification Techniques..."

  • ...Sujatha M, Prabhakar S, Lavanya GD....

    [...]

  • ...Another related work is “A survey of classification techniques in data mining” by Sujatha and Prabhakar [43]....

    [...]

  • ...Sujatha and Prabhakar published this study in 2013, prior to the release of CatBoost....

    [...]

05 Feb 2006
TL;DR: In this article, Naive Bayes Classifiers are used for Naïve Bayes classifiers to train a classifier for classifier training, and the classifier is evaluated.
Abstract: 在Naive Bayes Classifiers模型中,要求父节点下的子节点(特征变量)之间相对独立,然而在现实世界中,特征与特征之间是非独立的、相关的。提出一种预处理方法,实验结果表明,该方法明显地提高了分类精度。

213 citations

Journal ArticleDOI
TL;DR: This research work used C5.0 as the base classifier so proposed system will classify the result set with high accuracy and low memory usage, and over fitting problem of the decision tree is solved by using reduced error pruning technique.
Abstract: Data mining is a knowledge discovery process that analyzes data and generate useful pattern from it. Classification is the technique that uses pre-classified examples to classify the required results. Decision tree is used to model classification process. Using feature values of instances, Decision trees classify those instances. Each node in a decision tree represents a feature in an instance to be classified. In this research work ID3, C4.5 and C5.0 Compare with each other. Among all these classifiers C5.0 gives more accurate and efficient result. This research work used C5.0 as the base classifier so proposed system will classify the result set with high accuracy and low memory usage. The classification process generates fewer rules compare to other techniques so the proposed system has low memory usage. Error rate is low so accuracy in result set is high and pruned tree is generated so the system generates fast results as compare with other technique. In this research work proposed system use C5.0 classifier that Performs feature selection and reduced error pruning techniques which are described in this paper. Feature selection technique assumes that the data contains many redundant features. so remove that features which provides no useful information in any context. Select relevant features which are useful in model construction. Crossvalidation method gives more reliable estimate of predictive. Over fitting problem of the decision tree is solved by using reduced error pruning technique. With the proposed system achieve 1 to 3% of accuracy, reduced error rate and decision tree is construed within less time.

170 citations

01 Jan 2013
TL;DR: Various output with various distance used in algorithm may help to know the response of classifier for the desired application and represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.
Abstract: To classify data whether it is in the field of neural networks or maybe it is any application of Biometrics viz: Handwriting classification or Iris detection, feasibly the most candid classifier in the stockpile or machine learning techniques is the Nearest Neighbor Classifier in which classification is achieved by identifying the nearest neighbors to a query example and using those neighbors to determine the class of the query. K-NN classification classifies instances based on their similarity to instances in the training data. This paper presents various output with various distance used in algorithm and may help to know the response of classifier for the desired application it also represents computational issues in identifying nearest neighbors and mechanisms for reducing the dimension of the data.

107 citations


Cites background from "Survey on Classification Techniques..."

  • ...Consequently, data mining consists of more than collection and managing data, it also includes analysis and prediction [8]....

    [...]

Proceedings ArticleDOI
01 Dec 2015
TL;DR: This work presents a system, which uses data mining techniques in order to predict the category of the analyzed soil datasets, which will indicate the yielding of crops.
Abstract: Yield prediction is very popular among farmers these days, which particularly contributes to the proper selection of crops for sowing. This makes the problem of predicting the yielding of crops an interesting challenge. Earlier yield prediction was performed by considering the farmer's experience on a particular field and crop. This work presents a system, which uses data mining techniques in order to predict the category of the analyzed soil datasets. The category, thus predicted will indicate the yielding of crops. The problem of predicting the crop yield is formalized as a classification rule, where Naive Bayes and K-Nearest Neighbor methods are used.

86 citations

References
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Book
12 Aug 2008
TL;DR: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications and provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature.
Abstract: This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications. The authors present the basic ideas of SVMs together with the latest developments and current research questions in a unified style. They identify three reasons for the success of SVMs: their ability to learn well with only a very small number of free parameters, their robustness against several types of model violations and outliers, and their computational efficiency compared to several other methods. Since their appearance in the early nineties, support vector machines and related kernel-based methods have been successfully applied in diverse fields of application such as bioinformatics, fraud detection, construction of insurance tariffs, direct marketing, and data and text mining. As a consequence, SVMs now play an important role in statistical machine learning and are used not only by statisticians, mathematicians, and computer scientists, but also by engineers and data analysts. The book provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature. The book can thus serve as both a basis for graduate courses and an introduction for statisticians, mathematicians, and computer scientists. It further provides a valuable reference for researchers working in the field. The book covers all important topics concerning support vector machines such as: loss functions and their role in the learning process; reproducing kernel Hilbert spaces and their properties; a thorough statistical analysis that uses both traditional uniform bounds and more advanced localized techniques based on Rademacher averages and Talagrand's inequality; a detailed treatment of classification and regression; a detailed robustness analysis; and a description of some of the most recent implementation techniques. To make the book self-contained, an extensive appendix is added which provides the reader with the necessary background from statistics, probability theory, functional analysis, convex analysis, and topology.

4,664 citations

Journal ArticleDOI
TL;DR: This issue's collection of essays should help familiarize readers with this interesting new racehorse in the Machine Learning stable, and give a practical guide and a new technique for implementing the algorithm efficiently.
Abstract: My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Scholkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently.

4,319 citations

Book
01 Jan 2005
TL;DR: This book first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently.
Abstract: This book organizes key concepts, theories, standards, methodologies, trends, challenges and applications of data mining and knowledge discovery in databases. It first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. It also gives in-depth descriptions of data mining applications in various interdisciplinary industries.

2,836 citations

Proceedings Article
01 May 2010
TL;DR: This paper shows how to automatically collect a corpus for sentiment analysis and opinion mining purposes and builds a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document.
Abstract: Microblogging today has become a very popular communication tool among Internet users. Millions of users share opinions on different aspects of life everyday. Therefore microblogging web-sites are rich sources of data for opinion mining and sentiment analysis. Because microblogging has appeared relatively recently, there are a few research works that were devoted to this topic. In our paper, we focus on using Twitter, the most popular microblogging platform, for the task of sentiment analysis. We show how to automatically collect a corpus for sentiment analysis and opinion mining purposes. We perform linguistic analysis of the collected corpus and explain discovered phenomena. Using the corpus, we build a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document. Experimental evaluations show that our proposed techniques are efficient and performs better than previously proposed methods. In our research, we worked with English, however, the proposed technique can be used with any other language.

2,570 citations