scispace - formally typeset
Search or ask a question

Showing papers by "Pang-Ning Tan published in 2006"


Journal ArticleDOI
TL;DR: This paper proves that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coefficient), and shows that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels.
Abstract: Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not effective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving items which are from different support levels and are poorly correlated. In this paper, we present a framework for mining highly-correlated association patterns called hyperclique patterns. In this framework, an objective measure called h-confidence is applied to discover hyperclique patterns. We prove that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearson's correlation coefficient). Also, we show that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels. Indeed, this cross-support property is not limited to h-confidence and can be generalized to some other association measures. In addition, an algorithm called hyperclique miner is proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the efficient discovery of hyperclique patterns. Finally, our experimental results show that hyperclique miner can efficiently identify hyperclique patterns, even at extremely low levels of support.

164 citations


Journal Article
TL;DR: This paper examines two alternative approaches known as independent value prediction and parameter prediction, which builds a separate model for each prediction step using the values observed in the past to fit a parametric function to the time series.
Abstract: Multistep-ahead prediction is the task of predicting a sequence of values in a time series. A typical approach, known as multi-stage prediction, is to apply a predictive model step-by-step and use the predicted value of the current time step to determine its value in the next time step. This paper examines two alternative approaches known as independent value prediction and parameter prediction. The first approach builds a separate model for each prediction step using the values observed in the past. The second approach fits a parametric function to the time series and builds models to predict the parameters of the function. We perform a comparative study on the three approaches using multiple linear regression, recurrent neural networks, and a hybrid of hidden Markov model with multiple linear regression. The advantages and disadvantages of each approach are analyzed in terms of their error accumulation, smoothness of prediction, and learning difficulty.

149 citations


Proceedings ArticleDOI
18 Dec 2006
TL;DR: This paper presents two methods for transforming outlier scores into probabilities that models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule.
Abstract: Current outlier detection schemes typically output a numeric score representing the degree to which a given observation is an outlier. We argue that converting the scores into well-calibrated probability estimates is more favorable for several reasons. First, the probability estimates allow us to select the appropriate threshold for declaring outliers using a Bayesian risk model. Second, the probability estimates obtained from individual models can be aggregated to build an ensemble outlier detection framework. In this paper, we present two methods for transforming outlier scores into probabilities. The first approach assumes that the posterior probabilities follow a logistic sigmoid function and learns the parameters of the function from the distribution of outlier scores. The second approach models the score distributions as a mixture of exponential and Gaussian probability functions and calculates the posterior probabilites via the Bayes' rule. We evaluated the efficacy of both methods in the context of threshold selection and ensemble outlier detection. We also show that the calibration accuracy improves with the aid of some labeled examples.

143 citations


Proceedings ArticleDOI
13 Nov 2006
TL;DR: This paper presents a stochastic graph-based algorithm, called OutRank, for detecting outlying objects, which is more powerful than the existing outlier detection schemes and can effectively address the inherent problems of such schemes.
Abstract: The discovery of objects with exceptional behavior is an important challenge from a knowledge discovery standpoint and has attracted much attention recently. In this paper, we present a stochastic graph-based algorithm, called OutRank, for detecting outlying objects. In our method, a matrix is constructed using the similarity between objects and used as the adjacency matrix of the graph representation. The heart of this approach is the Markov model that is built upon this graph, which assigns an outlier score to each object. Using this framework, we show that our algorithm is more powerful than the existing outlier detection schemes and can effectively address the inherent problems of such schemes. Empirical studies conducted on both real and synthetic data sets show that significant improvements in detection rate and a lower false alarm rate are achieved using our proposed framework.

87 citations


Proceedings ArticleDOI
23 Apr 2006
TL;DR: This paper is concerned with employing supervision of limited amount of label information to detect outliers more accurately, with an objective function that punishes poor clustering results and deviation from known labels as well as restricts the number of outliers.
Abstract: Outlier detection has been extensively researched in the context of unsupervised learning. But the learning results are not always satisfactory, which can be significantly improved using supervision of some labeled points. In this paper, we are concerned with employing supervision of limited amount of label information to detect outliers more accurately. The key of our approach is an objective function that punishes poor clustering results and deviation from known labels as well as restricts the number of outliers. The outliers can be found as a solution to the discrete optimization problem regarding the objective function. By this way, this method can detect meaningful outliers that can not be identified by existing unsupervised methods.

49 citations


Journal Article
01 Jan 2006-Sensors

44 citations


Book ChapterDOI
09 Apr 2006
TL;DR: In this paper, independent value prediction and parameter prediction are compared in terms of their error accumulation, smoothness of prediction, and learning difficulty, and the advantages and disadvantages of each approach are analyzed.
Abstract: Multistep-ahead prediction is the task of predicting a sequence of values in a time series. A typical approach, known as multi-stage prediction, is to apply a predictive model step-by-step and use the predicted value of the current time step to determine its value in the next time step. This paper examines two alternative approaches known as independent value prediction and parameter prediction. The first approach builds a separate model for each prediction step using the values observed in the past. The second approach fits a parametric function to the time series and builds models to predict the parameters of the function. We perform a comparative study on the three approaches using multiple linear regression, recurrent neural networks, and a hybrid of hidden Markov model with multiple linear regression. The advantages and disadvantages of each approach are analyzed in terms of their error accumulation, smoothness of prediction, and learning difficulty.

41 citations


Proceedings ArticleDOI
18 Dec 2006
TL;DR: This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets by constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph.
Abstract: This paper presents PGMiner, a novel graph-based algorithm for mining frequent closed itemsets. Our approach consists of constructing a prefix graph structure and decomposing the database to variable length bit vectors, which are assigned to nodes of the graph. The main advantage of this representation is that the bit vectors at each node are relatively shorter than those produced by existing vertical mining methods. This facilitates fast frequency counting of itemsets via intersection operations. We also devise several inter- node and intra-node pruning strategies to substantially reduce the combinatorial search space. Unlike other existing approaches, we do not need to store in memory the entire set of closed itemsets that have been mined so far in order to check whether a candidate itemset is closed. This dramatically reduces the memory usage of our algorithm, especially for low support thresholds. Our experiments using synthetic and real-world data sets show that PGMiner outperforms existing mining algorithms by as much as an order of magnitude and is scalable to very large databases.

31 citations


Journal Article
01 Jan 2006-Scopus
TL;DR: In this article, the problem of incorporating partial background knowledge into clustering was formulated as a constrained optimization problem, and two learning algorithms based on hard and fuzzy clustering methods were proposed to solve the problem.
Abstract: Incorporating background knowledge into unsupervised clustering algorithms has been the subject of extensive research in recent years. Nevertheless, existing algorithms implicitly assume that the background information, typically specified in the form of labeled examples or pairwise constraints, has the same feature space as the unlabeled data to be clustered. In this paper, we are concerned with a new problem of incorporating partial background knowledge into clustering, where the labeled examples have moderate overlapping features with the unlabeled data. We formulate this as a constrained optimization problem, and propose two learning algorithms to solve the problem, based on hard and fuzzy clustering methods. An empirical study performed on a variety of real data sets shows that our proposed algorithms improve the quality of clustering results with limited labeled examples.

26 citations


Proceedings Article
01 Jan 2006
TL;DR: It is demonstrated that, with the addition of labeled examples, the anomaly detection algorithm can be guided to develop better models of the normal and abnormal behavior of the data, thus improving the detection rate and reducing the false alarm rate of the algorithm.
Abstract: This paper presents a principled approach for incorporating labeled examples into an anomaly detection task. We demonstrate that, with the addition of labeled examples, the anomaly detection algorithm can be guided to develop better models of the normal and abnormal behavior of the data, thus improving the detection rate and reducing the false alarm rate of the algorithm. A framework based on the finite mixture model is introduced to model the data as well as the constraints imposed by the labeled examples. Empirical studies conducted on real data sets show that significant improvements in detection rate and false alarm rate are achieved using our proposed framework.

23 citations


Proceedings Article
01 Jan 2006
TL;DR: This paper presents a novel clustering algorithm called MIN-CUT, which leads to purer clusters that have very little overlap, because if there is too much overlap, the clusters become less informative.
Abstract: In this paper, we study the ill-effects of bridgenodes, which causes many dissimilar objects to be placed together in the same cluster by existing clustering algorithms. We offer two new metrics for measuring how well a clustering algorithm handles the presence of bridge-nodes. We also illustrate how algorithms that produce overlapping clusters help to alleviate the effect of bridge-nodes and form more meaningful clusters. However, if there is too much overlap, the clusters become less informative. To address this problem, we present a novel clustering algorithm called MIN-CUT. Our experimental results with real data sets show that the MIN-CUT algorithm leads to purer clusters that have very little overlap.


Proceedings ArticleDOI
Brian D. Connelly1, C.W. Bowron1, Li Xiao1, Pang-Ning Tan1, Chen Wang1 
14 Aug 2006
TL;DR: This work borrows the technique of association analysis from the data mining community and extends it to intelligently forward queries through the network to scale unstructured peer-to-peer networks to much larger sizes.
Abstract: Unstructured peer-to-peer networks have become a very popular method for content distribution in the past few years. By not enforcing strict rules on the network?s topology or content location, such networks can be created quickly and easily. Unfortunately, because of the unstructured nature of these networks, in order to find content, query messages are flooded to nodes in the network, which results in a large amount of traffic. This work borrows the technique of association analysis from the data mining community and extends it to intelligently forward queries through the network. Because only a small subset of a node?s neighbors are forwarded queries, the number of times those queries are propagated is also reduced, which results in considerably less network traffic. These savings enable the networks to scale to much larger sizes, which allows for more content to be shared and more redundancy to be added to the system, as well as allowing more users to take advantage of such networks.


Proceedings Article
01 Jan 2006
TL;DR: This paper forms this as a constrained optimization problem, and proposes two learning algorithms to solve the problem, based on hard and fuzzy clustering methods, that improve the quality of clustering results with limited labeled examples.
Abstract: Incorporating background knowledge into unsupervised clustering algorithms has been the subject of extensive research in recent years Nevertheless, existing algorithms implicitly assume that the background information, typically specified in the form of labeled examples or pairwise constraints, has the same feature space as the unlabeled data to be clustered In this paper, we are concerned with a new problem of incorporating partial background knowledge into clustering, where the labeled examples have moderate overlapping features with the unlabeled data We formulate this as a constrained optimization problem, and propose two learning algorithms to solve the problem, based on hard and fuzzy clustering methods An empirical study performed on a variety of real data sets shows that our proposed algorithms improve the quality of clustering results with limited labeled examples