Data mining: an overview from a database perspective
01 Dec 1996-IEEE Transactions on Knowledge and Data Engineering (IEEE Computer Society)-Vol. 8, Iss: 6, pp 866-883
TL;DR: In this paper, a survey of the available data mining techniques is provided and a comparative study of such techniques is presented, based on a database researcher's point-of-view.
Abstract: Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Citations
More filters
•
08 Sep 2000TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data
23,600 citations
•
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization
20,196 citations
•
01 Jan 2013
TL;DR: This book discusses data mining through the lens of cluster analysis, which examines the relationships between data, clusters, and algorithms, and some of the techniques used to solve these problems.
Abstract: 1 Introduction 1.1 What is Data Mining? 1.2 Motivating Challenges 1.3 The Origins of Data Mining 1.4 Data Mining Tasks 1.5 Scope and Organization of the Book 1.6 Bibliographic Notes 1.7 Exercises 2 Data 2.1 Types of Data 2.2 Data Quality 2.3 Data Preprocessing 2.4 Measures of Similarity and Dissimilarity 2.5 Bibliographic Notes 2.6 Exercises 3 Exploring Data 3.1 The Iris Data Set 3.2 Summary Statistics 3.3 Visualization 3.4 OLAP and Multidimensional Data Analysis 3.5 Bibliographic Notes 3.6 Exercises 4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tree Induction 4.4 Model Overfitting 4.5 Evaluating the Performance of a Classifier 4.6 Methods for Comparing Classifiers 4.7 Bibliographic Notes 4.8 Exercises 5 Classification: Alternative Techniques 5.1 Rule-Based Classifier 5.2 Nearest-Neighbor Classifiers 5.3 Bayesian Classifiers 5.4 Artificial Neural Network (ANN) 5.5 Support Vector Machine (SVM) 5.6 Ensemble Methods 5.7 Class Imbalance Problem 5.8 Multiclass Problem 5.9 Bibliographic Notes 5.10 Exercises 6 Association Analysis: Basic Concepts and Algorithms 6.1 Problem Definition 6.2 Frequent Itemset Generation 6.3 Rule Generation 6.4 Compact Representation of Frequent Itemsets 6.5 Alternative Methods for Generating Frequent Itemsets 6.6 FP-Growth Algorithm 6.7 Evaluation of Association Patterns 6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes 6.10 Exercises 7 Association Analysis: Advanced Concepts 7.1 Handling Categorical Attributes 7.2 Handling Continuous Attributes 7.3 Handling a Concept Hierarchy 7.4 Sequential Patterns 7.5 Subgraph Patterns 7.6 Infrequent Patterns 7.7 Bibliographic Notes 7.8 Exercises 8 Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview 8.2 K-means 8.3 Agglomerative Hierarchical Clustering 8.4 DBSCAN 8.5 Cluster Evaluation 8.6 Bibliographic Notes 8.7 Exercises 9 Cluster Analysis: Additional Issues and Algorithms 9.1 Characteristics of Data, Clusters, and Clustering Algorithms 9.2 Prototype-Based Clustering 9.3 Density-Based Clustering 9.4 Graph-Based Clustering 9.5 Scalable Clustering Algorithms 9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises 10 Anomaly Detection 10.1 Preliminaries 10.2 Statistical Approaches 10.3 Proximity-Based Outlier Detection 10.4 Density-Based Outlier Detection 10.5 Clustering-Based Techniques 10.6 Bibliographic Notes 10.7 Exercises Appendix A Linear Algebra Appendix B Dimensionality Reduction Appendix C Probability and Statistics Appendix D Regression Appendix E Optimization Author Index Subject Index
7,356 citations
••
01 Jan 2008TL;DR: A theoretical framework describing the trust-based decision-making process a consumer uses when making a purchase from a given site is developed and the proposed model is tested using a Structural Equation Modeling technique on Internet consumer purchasing behavior data collected via a Web survey.
Abstract: Are trust and risk important in consumers' electronic commerce purchasing decisions? What are the antecedents of trust and risk in this context? How do trust and risk affect an Internet consumer's purchasing decision? To answer these questions, we i) develop a theoretical framework describing the trust-based decision-making process a consumer uses when making a purchase from a given site, ii) test the proposed model using a Structural Equation Modeling technique on Internet consumer purchasing behavior data collected via a Web survey, and iii) consider the implications of the model. The results of the study show that Internet consumers' trust and perceived risk have strong impacts on their purchasing decisions. Consumer disposition to trust, reputation, privacy concerns, security concerns, the information quality of the Website, and the company's reputation, have strong effects on Internet consumers' trust in the Website. Interestingly, the presence of a third-party seal did not strongly influence consumers' trust.
2,650 citations
01 Jan 2006
TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].
Abstract: The book Knowledge Discovery in Databases, edited by Piatetsky-Shapiro and Frawley [PSF91], is an early collection of research papers on knowledge discovery from data. The book Advances in Knowledge Discovery and Data Mining, edited by Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy [FPSSe96], is a collection of later research results on knowledge discovery and data mining. There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99], Building Data Mining Applications for CRM by Berson, Smith, and Thearling [BST99], Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank [WF05], Principles of Data Mining (Adaptive Computation and Machine Learning) by Hand, Mannila, and Smyth [HMS01], The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman [HTF01], Data Mining: Introductory and Advanced Topics by Dunham, and Data Mining: Multimedia, Soft Computing, and Bioinformatics by Mitra and Acharya [MA03]. There are also books containing collections of papers on particular aspects of knowledge discovery, such as Machine Learning and Data Mining: Methods and Applications edited by Michalski, Brakto, and Kubat [MBK98], and Relational Data Mining edited by Dzeroski and Lavrac [De01], as well as many tutorial notes on data mining in major database, data mining and machine learning conferences.
2,591 citations
References
More filters
•
15 Oct 1992TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher:
Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation.
C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies.
This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.
21,674 citations
••
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
17,177 citations
••
01 Jun 1993TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.
Abstract: We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.
15,645 citations
•
01 Jan 1965
TL;DR: This chapter discusses the concept of a Random Variable, the meaning of Probability, and the axioms of probability in terms of Markov Chains and Queueing Theory.
Abstract: Part 1 Probability and Random Variables 1 The Meaning of Probability 2 The Axioms of Probability 3 Repeated Trials 4 The Concept of a Random Variable 5 Functions of One Random Variable 6 Two Random Variables 7 Sequences of Random Variables 8 Statistics Part 2 Stochastic Processes 9 General Concepts 10 Random Walk and Other Applications 11 Spectral Representation 12 Spectral Estimation 13 Mean Square Estimation 14 Entropy 15 Markov Chains 16 Markov Processes and Queueing Theory
13,886 citations
•
01 Jan 2002
TL;DR: In this paper, the meaning of probability and random variables are discussed, as well as the axioms of probability, and the concept of a random variable and repeated trials are discussed.
Abstract: Part 1 Probability and Random Variables 1 The Meaning of Probability 2 The Axioms of Probability 3 Repeated Trials 4 The Concept of a Random Variable 5 Functions of One Random Variable 6 Two Random Variables 7 Sequences of Random Variables 8 Statistics Part 2 Stochastic Processes 9 General Concepts 10 Random Walk and Other Applications 11 Spectral Representation 12 Spectral Estimation 13 Mean Square Estimation 14 Entropy 15 Markov Chains 16 Markov Processes and Queueing Theory
12,407 citations