scispace - formally typeset
Search or ask a question
Author

Pang-Ning Tan

Bio: Pang-Ning Tan is an academic researcher from Michigan State University. The author has contributed to research in topics: Cluster analysis & Association rule learning. The author has an hindex of 43, co-authored 191 publications receiving 11436 citations. Previous affiliations of Pang-Ning Tan include University of Minnesota & United States Department of the Army.


Papers
More filters
Journal ArticleDOI
TL;DR: Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications as mentioned in this paper, where preprocessing, pattern discovery, and pattern analysis are described in detail.
Abstract: Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.

2,227 citations

OtherDOI
28 Jan 2022
TL;DR: This book discusses data mining through the lens of cluster analysis, which examines the relationships between data, clusters, and algorithms, and some of the techniques used to solve these problems.
Abstract: 1 Introduction 1.1 What is Data Mining? 1.2 Motivating Challenges 1.3 The Origins of Data Mining 1.4 Data Mining Tasks 1.5 Scope and Organization of the Book 1.6 Bibliographic Notes 1.7 Exercises 2 Data 2.1 Types of Data 2.2 Data Quality 2.3 Data Preprocessing 2.4 Measures of Similarity and Dissimilarity 2.5 Bibliographic Notes 2.6 Exercises 3 Exploring Data 3.1 The Iris Data Set 3.2 Summary Statistics 3.3 Visualization 3.4 OLAP and Multidimensional Data Analysis 3.5 Bibliographic Notes 3.6 Exercises 4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tree Induction 4.4 Model Overfitting 4.5 Evaluating the Performance of a Classifier 4.6 Methods for Comparing Classifiers 4.7 Bibliographic Notes 4.8 Exercises 5 Classification: Alternative Techniques 5.1 Rule-Based Classifier 5.2 Nearest-Neighbor Classifiers 5.3 Bayesian Classifiers 5.4 Artificial Neural Network (ANN) 5.5 Support Vector Machine (SVM) 5.6 Ensemble Methods 5.7 Class Imbalance Problem 5.8 Multiclass Problem 5.9 Bibliographic Notes 5.10 Exercises 6 Association Analysis: Basic Concepts and Algorithms 6.1 Problem Definition 6.2 Frequent Itemset Generation 6.3 Rule Generation 6.4 Compact Representation of Frequent Itemsets 6.5 Alternative Methods for Generating Frequent Itemsets 6.6 FP-Growth Algorithm 6.7 Evaluation of Association Patterns 6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes 6.10 Exercises 7 Association Analysis: Advanced Concepts 7.1 Handling Categorical Attributes 7.2 Handling Continuous Attributes 7.3 Handling a Concept Hierarchy 7.4 Sequential Patterns 7.5 Subgraph Patterns 7.6 Infrequent Patterns 7.7 Bibliographic Notes 7.8 Exercises 8 Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview 8.2 K-means 8.3 Agglomerative Hierarchical Clustering 8.4 DBSCAN 8.5 Cluster Evaluation 8.6 Bibliographic Notes 8.7 Exercises 9 Cluster Analysis: Additional Issues and Algorithms 9.1 Characteristics of Data, Clusters, and Clustering Algorithms 9.2 Prototype-Based Clustering 9.3 Density-Based Clustering 9.4 Graph-Based Clustering 9.5 Scalable Clustering Algorithms 9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises 10 Anomaly Detection 10.1 Preliminaries 10.2 Statistical Approaches 10.3 Proximity-Based Outlier Detection 10.4 Density-Based Outlier Detection 10.5 Clustering-Based Techniques 10.6 Bibliographic Notes 10.7 Exercises Appendix A Linear Algebra Appendix B Dimensionality Reduction Appendix C Probability and Statistics Appendix D Regression Appendix E Optimization Author Index Subject Index

1,118 citations

Proceedings ArticleDOI
23 Jul 2002
TL;DR: An overview of various measures proposed in the statistics, machine learning and data mining literature is presented and it is shown that each measure has different properties which make them useful for some application domains, but not for others.
Abstract: Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.

985 citations

Journal ArticleDOI
01 Jun 2004
TL;DR: This paper describes several key properties one should examine in order to select the right measure for a given application and presents an algorithm for selecting a small set of patterns so that domain experts can find a measure that best fits their requirements by ranking this smallSet of patterns.
Abstract: Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data mining practitioners also tend to apply an objective measure without realizing that there may be better alternatives available for their application. In this paper, we describe several key properties one should examine in order to select the right measure for a given application. A comparative study of these properties is made using twenty-one measures that were originally developed in diverse fields such as statistics, social science, machine learning, and data mining. We show that depending on its properties, each measure is useful for some application, but not for others. We also demonstrate two scenarios in which many existing measures become consistent with each other, namely, when support-based pruning and a technique known as table standardization are applied. Finally, we present an algorithm for selecting a small set of patterns such that domain experts can find a measure that best fits their requirements by ranking this small set of patterns.

575 citations


Cited by
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations