scispace - formally typeset
Search or ask a question

Showing papers in "Data Mining and Knowledge Discovery in 2004"


Journal ArticleDOI
TL;DR: A novel frequent-pattern tree (FP-tree) structure is proposed, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and an efficient FP-tree-based mining method, FP-growth, is developed for mining the complete set of frequent patterns by pattern fragment growth.
Abstract: Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.

2,567 citations


Journal ArticleDOI
TL;DR: An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs.
Abstract: Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of datas (2) a score has a clear statistical/information-theoretic meanings (3) it is computationally inexpensives and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

592 citations


Journal ArticleDOI
TL;DR: A new framework for associations based on the concept of closed frequent itemsets is presented, with the number of non-redundant rules produced by the new approach is exponentially smaller than the rule set from the traditional approach.
Abstract: The traditional association rule mining framework produces many redundant rules. The extent of redundancy is a lot larger than previously suspected. We present a new framework for associations based on the concept of closed frequent itemsets. The number of non-redundant rules produced by the new approach is exponentially (in the length of the longest frequent itemset) smaller than the rule set from the traditional approach. Experiments using several “hard” as well as “easy” real and synthetic databases confirm the utility of our framework in terms of reduction in the number of rules presented to the user, and in terms of time.

421 citations


Journal ArticleDOI
TL;DR: An approach to induce high-precision maps from traces of vehicles equipped with differential GPS receivers is presented, with new contributions are a spatial clustering algorithm for inferring the connectivity structure, more powerful lane finding algorithms that are able to handle lane splits and merges, and an approach to inferring detailed intersection models.
Abstract: Despite the increasing popularity of route guidance systems, current digital maps are still inadequate for many advanced applications in automotive safety and convenience. Among the drawbacks are the insufficient accuracy of road geometry and the lack of fine-grained information, such as lane positions and intersection structure. In this paper, we present an approach to induce high-precision maps from traces of vehicles equipped with differential GPS receivers. Since the cost of these systems is rapidly decreasing and wireless technology is advancing to provide the communication infrastructure, we expect that in the next few years large amounts of car data will be available inexpensively. Our approach consists of successive processing steps: individual vehicle trajectories are divided into road segments and intersections; a road centerline is derived for each segment; lane positions are determined by clustering the perpendicular offsets from it; and the transitions of traces between segments are utilized in the generation of intersection models. This paper describes an approach to this complex data-mining task in a contiguous manner. Among the new contributions are a spatial clustering algorithm for inferring the connectivity structure, more powerful lane finding algorithms that are able to handle lane splits and merges, and an approach to inferring detailed intersection models.

260 citations


Journal ArticleDOI
TL;DR: A class of methods are introduced that begin by using a single database pass to perform a partial computation of the totals required, storing these in the form of a set enumeration tree, which is created in time linear to the size of the database.
Abstract: A well-known approach to Knowledge Discovery in Databases involves the identification of association rules linking database attributes. Extracting all possible association rules from a database, however, is a computationally intractable problem, because of the combinatorial explosion in the number of sets of attributes for which incidence-counts must be computed. Existing methods for dealing with this may involve multiple passes of the database, and tend still to cope badly with densely-packed database records. We describe here a class of methods we have introduced that begin by using a single database pass to perform a partial computation of the totals required, storing these in the form of a set enumeration tree, which is created in time linear to the size of the database. Algorithms for using this structure to complete the count summations are discussed, and a method is described, derived from the well-known Apriori algorithm. Results are presented demonstrating the performance advantage to be gained from the use of this approach. Finally, we discuss possible further applications of the method.

134 citations


Journal ArticleDOI
TL;DR: A notion of convertible constraints is developed and systematically analyze, classify, and characterize this class and techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining are developed.
Abstract: Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. Constraint pushing techniques have been developed for mining frequent patterns and associations with antimonotonic, monotonic, and succinct constraints. In this paper, we study constraints which cannot be handled with existing theory and techniques in frequent pattern mining. For example, avg(S)θv, median(S)θv, sum(S)θv (S can contain items of arbitrary values, θ ∈ l>, <, ≤, ≥r and v is a real number.) are customarily regarded as “tough” constraints in that they cannot be pushed inside an algorithm such as Apriori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.

130 citations


Journal ArticleDOI
TL;DR: In this paper, an extension of the learning rules in a Principal Component Analysis network which has been derived to be optimal for a specific probability density function is presented. But the authors do not consider the probability density functions in this paper, and instead view the whole family of learning rules as methods of performing exploration pursuit.
Abstract: In this paper, we review an extension of the learning rules in a Principal Component Analysis network which has been derived to be optimal for a specific probability density function. We note that this probability density function is one of a family of pdfs and investigate the learning rules formed in order to be optimal for several members of this family. We show that, whereas we have previously (Lai et al., 2000s Fyfe and MacDonald, 2002) viewed the single member of the family as an extension of PCA, it is more appropriate to view the whole family of learning rules as methods of performing Exploratory Projection Pursuit. We illustrate this on both artificial and real data sets.

106 citations


Journal ArticleDOI
TL;DR: This study integrates the discovery of frequent itemsets with a (microeconomic) model for product selection (PROFSET) that enables the integration of both quantitative and qualitative criteria and demonstrates that the impact of product assortment decisions on overall assortment profitability can easily be evaluated by means of sensitivity analysis.
Abstract: It has been claimed that the discovery of association rules is well suited for applications of market basket analysis to reveal regularities in the purchase behaviour of customers. However today, one disadvantage of associations discovery is that there is no provision for taking into account the business value of an association. Therefore, recent work indicates that the discovery of interesting rules can in fact best be addressed within a microeconomic framework. This study integrates the discovery of frequent itemsets with a (microeconomic) model for product selection (PROFSET). The model enables the integration of both quantitative and qualitative (domain knowledge) criteria. Sales transaction data from a fully automated convenience store are used to demonstrate the effectiveness of the model against a heuristic for product selection based on product-specific profitability. We show that with the use of frequent itemsets we are able to identify the cross-sales potential of product items and use this information for better product selection. Furthermore, we demonstrate that the impact of product assortment decisions on overall assortment profitability can easily be evaluated by means of sensitivity analysis.

84 citations


Journal ArticleDOI
TL;DR: This work proposes an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers by using medians rather than means as estimators for the centers of clusters.
Abstract: General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-MEANS has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-MEANS has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using medians rather than means as estimators for the centers of clusters. Comparison with k-MEANS, EXPECTATION and MAXIMIZATION sampling demonstrates the advantages of our algorithm.

81 citations


Journal ArticleDOI
TL;DR: The key factors that influence the performance of the pattern growth approach are identified, and the combination of the top-down traversal strategy and the ascending frequency order achieves significant performance improvement over previous works.
Abstract: Mining frequent patterns, including mining frequent closed patterns or maximal patterns, is a fundamental and important problem in data mining area. Many algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-and-test approach, especially when long patterns exist in the datasets. In this paper, we identify the key factors that influence the performance of the pattern growth approach, and optimize them to further improve the performance. Our algorithm uses a simple while compact data structure—ascending frequency ordered prefix-tree (AFOPT) to store the conditional databases, in which we use arrays to store single branches to further save space. The AFOPT structure is traversed in top-down depth-first order. Our analysis and experiment results show that the combination of the top-down traversal strategy and the ascending frequency order achieves significant performance improvement over previous works.

72 citations


Journal ArticleDOI
TL;DR: A comparative study on different kinds of sequential association rules for web document prediction shows that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules.
Abstract: Web servers keep track of web users' browsing behavior in web logs. From these logs, one can build statistical models that predict the users' next requests based on their current behavior. These data are complex due to their large size and sequential nature. In the past, researchers have proposed different methods for building association-rule based prediction models using the web logs, but there has been no systematic study on the relative merits of these methods. In this paper, we provide a comparative study on different kinds of sequential association rules for web document prediction. We show that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules. From this comparison we propose a best overall method and empirically test the proposed model on real web logs.

Journal ArticleDOI
TL;DR: This paper presents a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication and are also extremely robust.
Abstract: Mining for associations between items in large transactional databases is a central problem in the field of knowledge discovery. When the database is partitioned among several share-nothing machines, the problem can be addressed using distributed data mining algorithms. One such algorithm, called CD, was proposed by Agrawal and Shafer and was later enhanced by the FDM algorithm of Cheung, Han et al. The main problem with these algorithms is that they do not scale well with the number of partitions. They are thus impractical for use in modern distributed environments such as peer-to-peer systems, in which hundreds or thousands of computers may interact. In this paper we present a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication. In addition to being very efficient, the new algorithms are also extremely robust. Unlike existing algorithms, they continue to be efficient even when the data is skewed or the partition sizes are imbalanced. We present both experimental and theoretical results concerning the behavior of these algorithms and explain how they can be implemented in different settings.

Journal ArticleDOI
TL;DR: An empirical comparison between the methods used in e-VZpro and other collaborative filtering methods including dependency networks, item-based, and association mining is provided in this paper.
Abstract: Commercial recommender systems use various data mining techniques to make appropriate recommendations to users during online, real-time sessions. Published algorithms focus more on the discrete user ratings instead of binary results, which hampers their predictive capabilities when usage data is sparse. The system proposed in this paper, e-VZpro, is an association mining-based recommender tool designed to overcome these problems through a two-phase approach. In the first phase, batches of customer historical data are analyzed through association mining in order to determine the association rules for the second phase. During the second phase, a scoring algorithm is used to rank the recommendations online for the customer. The second phase differs from the traditional approach and an empirical comparison between the methods used in e-VZpro and other collaborative filtering methods including dependency networks, item-based, and association mining is provided in this paper. This comparison evaluates the algorithms used in each of the above methods using two internal customer datasets and a benchmark dataset. The results of this comparison clearly show that e-VZpro performs well compared to dependency networks and association mining. In general, item-based algorithms with cosine similarity measures have the best performance.

Journal ArticleDOI
TL;DR: A technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm that works for all convex and cumulative evaluation functions.
Abstract: We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms. We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality. Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions. Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.

Journal ArticleDOI
TL;DR: It is formally proved that the proposed algorithm for subsequence matching that supports normalization transform in time-series databases does not cause false dismissal and is suitable for practical situations, where the queries with smaller selectivities are much more frequent.
Abstract: In this paper, an algorithm is proposed for subsequence matching that supports normalization transform in time-series databases Normalization transform enables finding sequences with similar fluctuation patterns even though they are not close to each other before the normalization transform Simple application of existing subsequence matching algorithms to support normalization transform is not feasible since the algorithms do not have information for normalization transform of subsequences of arbitrary lengths Application of the existing whole matching algorithm supporting normalization transform to the subsequence matching is feasible, but requires an index for every possible length of the query sequence causing serious overhead on both storage space and update time The proposed algorithm generates indexes only for a small number of different lengths of query sequences For subsequence matching it selects the most appropriate index among them Better search performance can be obtained by using more indexes In this paper, the approach is called index interpolation It is formally proved that the proposed algorithm does not cause false dismissal The search performance can be traded off with storage space by adjusting the number of indexes For performance evaluation, a series of experiments is conducted using the indexes for only five different lengths out of lengths 256~512 of the query sequence The results show that the proposed algorithm outperforms the sequential scan by up to 24 times on the average when the selectivity of the query is 10?2 and up to 146 times when it is 10?5 Since the proposed algorithm performs better with smaller selectivities, it is suitable for practical situations, where the queries with smaller selectivities are much more frequent

Journal ArticleDOI
TL;DR: This work proposes a novel hypergraph model to represent the relations among the patterns, and introduces a vertex-to-cluster affinity concept to enable the use of existing metrics in the second phase.
Abstract: In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depends on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two-phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.

Journal ArticleDOI
TL;DR: The concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence and the bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure.
Abstract: In this paper, we focus on mining surprising periodic patterns in a sequence of events. In many applications, e.g., computational biology, an infrequent pattern is still considered very significant if its actual occurrence frequency exceeds the prior expectation by a large margin. The traditional metric, such as support, is not necessarily the ideal model to measure this kind of surprising patterns because it treats all patterns equally in the sense that every occurrence carries the same weight towards the assessment of the significance of a pattern regardless of the probability of occurrence. A more suitable measurement, information, is introduced to naturally value the degree of surprise of each occurrence of a pattern as a continuous and monotonically decreasing function of its probability of occurrence. This would allow patterns with vastly different occurrence probabilities to be handled seamlessly. As the accumulated degree of surprise of all repetitions of a pattern, the concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence. The bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure and in turn provides an efficient solution to this problem. Furthermore, the user has a choice between specifying a minimum information gain threshold and choosing the number of surprising patterns wanted. Empirical tests demonstrate the efficiency and the usefulness of the proposed model.

Journal ArticleDOI
TL;DR: A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed based on an iterated local fit without a priori metric assumptions which provides good results with large data sets.
Abstract: A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

Journal ArticleDOI
TL;DR: This paper studies how to incrementally modify and maintain the concise boundary descriptions of the space of all emerging patterns when small changes occur to the data, and introduces algorithms to handle four types of changes.
Abstract: Emerging patterns (EPs) are useful knowledge patterns with many applications. In recent studies on bio-medical profiling data, we have successfully used such patterns to solve difficult cancer diagnosis problems and produced higher classification accuracy when compared to alternative methods. However, the discovery of EPs is a challenging and computationally expensive problem. In this paper, we study how to incrementally modify and maintain the concise boundary descriptions of the space of all emerging patterns when small changes occur to the data. As EP spaces are convex, the maintenance on the bounds guarantees that no desired patterns are lost. We introduce algorithms to handle four types of changes: insertion of new data, deletion of old data, addition of new attributes, and deletion of old attributes. We compare these incremental algorithms, on six benchmark data sets, against an efficient algorithm that computes from scratch. The results show that the incremental algorithms are much faster than the From-Scratch method, often with tremendous speed-up rates.

Journal ArticleDOI
TL;DR: Two selection techniques are proposed: one based on a pre-selection procedure and anotherbased on a genetic algorithm for the selection of inputs for classification models based on ratios of measured quantities.
Abstract: This paper is concerned with the selection of inputs for classification models based on ratios of measured quantities. For this purpose, all possible ratios are built from the quantities involved and variable selection techniques are used to choose a convenient subset of ratios. In this context, two selection techniques are proposed: one based on a pre-selection procedure and another based on a genetic algorithm. In an example involving the financial distress prediction of companies, the models obtained from ratios selected by the proposed techniques compare favorably to a model using ratios usually found in the financial distress literature.

Journal ArticleDOI
Charu C. Aggarwal1
TL;DR: This paper will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate, and refers to this technique as collaborative crawling because it mines the collective user experiences inorder to find topical resources.
Abstract: In recent years, there has been considerable research on constructing crawlers which find resources satisfying specific conditions called predicates. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint on the internal structure of the web page. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for performing the topic specific resource discovery process. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. For example, proxy caches such as Squid are hierarchical proxies which make their logs publically available. As we shall see in this paper, such traces are a rich source of information which can be mined in order to find the users that are most relevant to the topic of a given crawl. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy turns out to be extremely effective because the topical consistency in world wide web browsing patterns turns out to very high compared to the noisy linkage information. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system which works more effectively than a system based purely on either user behavior or hyperlinks.

Journal ArticleDOI
TL;DR: A new, abstract method derived from the convex set theory that can be used to determine the properties of an unknown, complex data set and to assist in finding the most appropriate recognition algorithm.
Abstract: In this paper a new, abstract method for analysis and visualization of multidimensional data sets in pattern recognition problems is introduced. It can be used to determine the properties of an unknown, complex data set and to assist in finding the most appropriate recognition algorithm. Additionally, it can be employed to design layers of a feedforward artificial neural network or to visualize the higher-dimensional problems in 2-D and 3-D without losing relevant data set information. The method is derived from the convex set theory and works by considering convex subsets within the data and analyzing their respective positions in the original dimension. Its ability to describe certain set features that cannot be explicitly projected into lower dimensions sets it apart from many other visualization techniques. Two classical multidimensional problems are analyzed and the results show the usefulness of the presented method and underline its strengths and weaknesses.