Showing papers by "Pang-Ning Tan published in 2006"

PDF

Open Access

Journal Article•DOI•

[...]

Hui Xiong¹, Pang-Ning Tan², Vipin Kumar³•Institutions (3)

Rutgers University¹, Michigan State University², University of Minnesota³

01 Sep 2006-Data Mining and Knowledge Discovery

TL;DR: This paper proves that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coefficient), and shows that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels.

...read moreread less

Abstract: Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not effective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving items which are from different support levels and are poorly correlated. In this paper, we present a framework for mining highly-correlated association patterns called hyperclique patterns. In this framework, an objective measure called h-confidence is applied to discover hyperclique patterns. We prove that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coefficient). Also, we show that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels. Indeed, this cross-support property is not limited to h-confidence and can be generalized to some other association measures. In addition, an algorithm called hyperclique miner is proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the efficient discovery of hyperclique patterns. Finally, our experimental results show that hyperclique miner can efficiently identify hyperclique patterns, even at extremely low levels of support.

...read moreread less

164 citations

Journal Article•

Multistep-ahead time series prediction

[...]

Haibin Cheng, Pang-Ning Tan, Jing Gao, Jerry Scripps

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: This paper examines two alternative approaches known as independent value prediction and parameter prediction, which builds a separate model for each prediction step using the values observed in the past to fit a parametric function to the time series.

...read moreread less

Abstract: Multistep-ahead prediction is the task of predicting a sequence of values in a time series. A typical approach, known as multi-stage prediction, is to apply a predictive model step-by-step and use the predicted value of the current time step to determine its value in the next time step. This paper examines two alternative approaches known as independent value prediction and parameter prediction. The first approach builds a separate model for each prediction step using the values observed in the past. The second approach fits a parametric function to the time series and builds models to predict the parameters of the function. We perform a comparative study on the three approaches using multiple linear regression, recurrent neural networks, and a hybrid of hidden Markov model with multiple linear regression. The advantages and disadvantages of each approach are analyzed in terms of their error accumulation, smoothness of prediction, and learning difficulty.

...read moreread less

149 citations

Proceedings Article•DOI•

Converting Output Scores from Outlier Detection Algorithms into Probability Estimates

[...]

Jing Gao¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

18 Dec 2006

TL;DR: This paper presents two methods for transforming outlier scores into probabilities that models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule.

...read moreread less

Abstract: Current outlier detection schemes typically output a numeric score representing the degree to which a given observation is an outlier. We argue that converting the scores into well-calibrated probability estimates is more favorable for several reasons. First, the probability estimates allow us to select the appropriate threshold for declaring outliers using a Bayesian risk model. Second, the probability estimates obtained from individual models can be aggregated to build an ensemble outlier detection framework. In this paper, we present two methods for transforming outlier scores into probabilities. The first approach assumes that the posterior probabilities follow a logistic sigmoid function and learns the parameters of the function from the distribution of outlier scores. The second approach models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule. We evaluated the efficacy of both methods in the context of threshold selection and ensemble outlier detection. We also show that the calibration accuracy improves with the aid of some labeled examples.

...read moreread less

143 citations

Proceedings Article•DOI•

Outlier Detection Using Random Walks

[...]

H.D.K. Moonesignhe¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

13 Nov 2006

TL;DR: This paper presents a stochastic graph-based algorithm, called OutRank, for detecting outlying objects, which is more powerful than the existing outlier detection schemes and can effectively address the inherent problems of such schemes.

...read moreread less

Abstract: The discovery of objects with exceptional behavior is an important challenge from a knowledge discovery standpoint and has attracted much attention recently. In this paper, we present a stochastic graph-based algorithm, called OutRank, for detecting outlying objects. In our method, a matrix is constructed using the similarity between objects and used as the adjacency matrix of the graph representation. The heart of this approach is the Markov model that is built upon this graph, which assigns an outlier score to each object. Using this framework, we show that our algorithm is more powerful than the existing outlier detection schemes and can effectively address the inherent problems of such schemes. Empirical studies conducted on both real and synthetic data sets show that significant improvements in detection rate and a lower false alarm rate are achieved using our proposed framework.

...read moreread less

87 citations

Proceedings Article•DOI•

Semi-supervised outlier detection

[...]

Jing Gao¹, Haibin Cheng¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

23 Apr 2006

TL;DR: This paper is concerned with employing supervision of limited amount of label information to detect outliers more accurately, with an objective function that punishes poor clustering results and deviation from known labels as well as restricts the number of outliers.

...read moreread less

Abstract: Outlier detection has been extensively researched in the context of unsupervised learning. But the learning results are not always satisfactory, which can be significantly improved using supervision of some labeled points. In this paper, we are concerned with employing supervision of limited amount of label information to detect outliers more accurately. The key of our approach is an objective function that punishes poor clustering results and deviation from known labels as well as restricts the number of outliers. The outliers can be found as a solution to the discrete optimization problem regarding the objective function. By this way, this method can detect meaningful outliers that can not be identified by existing unsupervised methods.

...read moreread less

49 citations

Journal Article•

Knowledge Discovery from Sensor Data

[...]

Pang-Ning Tan

01 Jan 2006-Sensors

44 citations

Book Chapter•DOI•

Multistep-Ahead time series prediction

[...]

Haibin Cheng¹, Pang-Ning Tan¹, Jing Gao¹, Jerry Scripps¹•Institutions (1)

Michigan State University¹

09 Apr 2006

TL;DR: In this paper, independent value prediction and parameter prediction are compared in terms of their error accumulation, smoothness of prediction, and learning difficulty, and the advantages and disadvantages of each approach are analyzed.

...read moreread less

41 citations

Proceedings Article•DOI•

Frequent Closed Itemset Mining Using Prefix Graphs with an Efficient Flow-Based Pruning Strategy

[...]

H.D.K. Moonesinghe¹, Samah Jamal Fodeh¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

18 Dec 2006

TL;DR: This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets by constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph.

...read moreread less

Abstract: This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets. Our approach consists of constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph. The main advantage of this representation is that the bit vectors at each node are relatively shorter than those produced by existing vertical mining methods. This facilitates fast frequency counting of itemsets via intersection operations. We also devise several inter- node and intra-node pruning strategies to substantially reduce the combinatorial search space. Unlike other existing approaches, we do not need to store in memory the entire set of closed itemsets that have been mined so far in order to check whether a candidate itemset is closed. This dramatically reduces the memory usage of our algorithm, especially for low support thresholds. Our experiments using synthetic and real-world data sets show that PGMiner outperforms existing mining algorithms by as much as an order of magnitude and is scalable to very large databases.

...read moreread less

31 citations

Journal Article•

Semi-supervised clustering with partial background information

[...]

Jing Gao¹, Pang-Ning Tan¹, Haibin Cheng¹•Institutions (1)

Michigan State University¹

01 Jan 2006-Scopus

TL;DR: In this article, the problem of incorporating partial background knowledge into clustering was formulated as a constrained optimization problem, and two learning algorithms based on hard and fuzzy clustering methods were proposed to solve the problem.

...read moreread less

Abstract: Incorporating background knowledge into unsupervised clustering algorithms has been the subject of extensive research in recent years. Nevertheless, existing algorithms implicitly assume that the background information, typically specified in the form of labeled examples or pairwise constraints, has the same feature space as the unlabeled data to be clustered. In this paper, we are concerned with a new problem of incorporating partial background knowledge into clustering, where the labeled examples have moderate overlapping features with the unlabeled data. We formulate this as a constrained optimization problem, and propose two learning algorithms to solve the problem, based on hard and fuzzy clustering methods. An empirical study performed on a variety of real data sets shows that our proposed algorithms improve the quality of clustering results with limited labeled examples.

...read moreread less

26 citations

Proceedings Article•

A Novel Framework for Incorporating Labeled Examples into Anomaly Detection.

[...]

Jing Gao¹, Haibin Cheng¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

01 Jan 2006

TL;DR: It is demonstrated that, with the addition of labeled examples, the anomaly detection algorithm can be guided to develop better models of the normal and abnormal behavior of the data, thus improving the detection rate and reducing the false alarm rate of the algorithm.

...read moreread less

Abstract: This paper presents a principled approach for incorporating labeled examples into an anomaly detection task. We demonstrate that, with the addition of labeled examples, the anomaly detection algorithm can be guided to develop better models of the normal and abnormal behavior of the data, thus improving the detection rate and reducing the false alarm rate of the algorithm. A framework based on the finite mixture model is introduced to model the data as well as the constraints imposed by the labeled examples. Empirical studies conducted on real data sets show that significant improvements in detection rate and false alarm rate are achieved using our proposed framework.

...read moreread less

23 citations

Proceedings Article•

Clustering in the Presence of Bridge-Nodes.

[...]

Jerry Scripps¹, Pang-Ning Tan¹•Institutions (1)

Michigan State University¹

01 Jan 2006

TL;DR: This paper presents a novel clustering algorithm called MIN-CUT, which leads to purer clusters that have very little overlap, because if there is too much overlap, the clusters become less informative.

...read moreread less

Abstract: In this paper, we study the ill-effects of bridgenodes, which causes many dissimilar objects to be placed together in the same cluster by existing clustering algorithms. We offer two new metrics for measuring how well a clustering algorithm handles the presence of bridge-nodes. We also illustrate how algorithms that produce overlapping clusters help to alleviate the effect of bridge-nodes and form more meaningful clusters. However, if there is too much overlap, the clusters become less informative. To address this problem, we present a novel clustering algorithm called MIN-CUT. Our experimental results with real data sets show that the MIN-CUT algorithm leads to purer clusters that have very little overlap.

...read moreread less

Knoledge discovery from sensor data

[...]

Pang-Ning Tan

01 Jan 2006

Proceedings Article•DOI•

Adaptively Routing P2P Queries Using Association Analysis

[...]

Brian D. Connelly¹, C.W. Bowron¹, Li Xiao¹, Pang-Ning Tan¹, Chen Wang¹ - Show less +1 more•Institutions (1)

Michigan State University¹

14 Aug 2006

TL;DR: This work borrows the technique of association analysis from the data mining community and extends it to intelligently forward queries through the network to scale unstructured peer-to-peer networks to much larger sizes.

...read moreread less

Abstract: Unstructured peer-to-peer networks have become a very popular method for content distribution in the past few years. By not enforcing strict rules on the network?s topology or content location, such networks can be created quickly and easily. Unfortunately, because of the unstructured nature of these networks, in order to find content, query messages are flooded to nodes in the network, which results in a large amount of traffic. This work borrows the technique of association analysis from the data mining community and extends it to intelligently forward queries through the network. Because only a small subset of a node?s neighbors are forwarded queries, the number of times those queries are propagated is also reduced, which results in considerably less network traffic. These savings enable the networks to scale to much larger sizes, which allows for more content to be shared and more redundancy to be added to the system, as well as allowing more users to take advantage of such networks.

...read moreread less

Instructor's Solution Manual

[...]

Pang-Ning Tan, Michael Steinbach, Vipin Kumar

01 Jan 2006

Proceedings Article•

Semi-Supervised Clustering with Partial Background Information.

[...]

Jing Gao¹, Pang-Ning Tan¹, Haibin Cheng¹•Institutions (1)

Michigan State University¹

01 Jan 2006

TL;DR: This paper forms this as a constrained optimization problem, and proposes two learning algorithms to solve the problem, based on hard and fuzzy clustering methods, that improve the quality of clustering results with limited labeled examples.

...read moreread less

Abstract: Incorporating background knowledge into unsupervised clustering algorithms has been the subject of extensive research in recent years Nevertheless, existing algorithms implicitly assume that the background information, typically specified in the form of labeled examples or pairwise constraints, has the same feature space as the unlabeled data to be clustered In this paper, we are concerned with a new problem of incorporating partial background knowledge into clustering, where the labeled examples have moderate overlapping features with the unlabeled data We formulate this as a constrained optimization problem, and propose two learning algorithms to solve the problem, based on hard and fuzzy clustering methods An empirical study performed on a variety of real data sets shows that our proposed algorithms improve the quality of clustering results with limited labeled examples

...read moreread less

Introduction to data mining / Pang- Ning Tan.; Michael Steinbach

[...]

Pang-Ning Tan, Michael Steinbach, Vipin Kumar

01 Jan 2006