Showing papers in &quot;Data Mining and Knowledge Discovery in 2004&quot;

Mining Non-Redundant Association Rules

TL;DR: An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs.

...read moreread less

Abstract: Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of datas (2) a score has a clear statistical/information-theoretic meanings (3) it is computationally inexpensives and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

...read moreread less

592 citations

Journal Article•DOI•

[...]

Mohammed J. Zaki¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Nov 2004-Data Mining and Knowledge Discovery

TL;DR: A new framework for associations based on the concept of closed frequent itemsets is presented, with the number of non-redundant rules produced by the new approach is exponentially smaller than the rule set from the traditional approach.

...read moreread less

Abstract: The traditional association rule mining framework produces many redundant rules. The extent of redundancy is a lot larger than previously suspected. We present a new framework for associations based on the concept of closed frequent itemsets. The number of non-redundant rules produced by the new approach is exponentially (in the length of the longest frequent itemset) smaller than the rule set from the traditional approach. Experiments using several “hard” as well as “easy” real and synthetic databases confirm the utility of our framework in terms of reduction in the number of rules presented to the user, and in terms of time.

...read moreread less

421 citations

Journal Article•DOI•

Mining GPS Traces for Map Refinement

[...]

Stefan Schroedl¹, Kiri L. Wagstaff¹, Seth Rogers¹, Pat Langley¹, Christopher Kenneth Hoover Wilson¹ - Show less +1 more•Institutions (1)

Daimler AG¹

Tree Structures for Mining Association Rules

TL;DR: An approach to induce high-precision maps from traces of vehicles equipped with differential GPS receivers is presented, with new contributions are a spatial clustering algorithm for inferring the connectivity structure, more powerful lane finding algorithms that are able to handle lane splits and merges, and an approach to inferring detailed intersection models.

...read moreread less

Abstract: Despite the increasing popularity of route guidance systems, current digital maps are still inadequate for many advanced applications in automotive safety and convenience. Among the drawbacks are the insufficient accuracy of road geometry and the lack of fine-grained information, such as lane positions and intersection structure. In this paper, we present an approach to induce high-precision maps from traces of vehicles equipped with differential GPS receivers. Since the cost of these systems is rapidly decreasing and wireless technology is advancing to provide the communication infrastructure, we expect that in the next few years large amounts of car data will be available inexpensively. Our approach consists of successive processing steps: individual vehicle trajectories are divided into road segments and intersections; a road centerline is derived for each segment; lane positions are determined by clustering the perpendicular offsets from it; and the transitions of traces between segments are utilized in the generation of intersection models. This paper describes an approach to this complex data-mining task in a contiguous manner. Among the new contributions are a spatial clustering algorithm for inferring the connectivity structure, more powerful lane finding algorithms that are able to handle lane splits and merges, and an approach to inferring detailed intersection models.

...read moreread less

260 citations

Journal Article•DOI•

[...]

Frans Coenen¹, Graham Goulbourne¹, Paul H. Leng¹•Institutions (1)

University of Liverpool¹

01 Jan 2004-Data Mining and Knowledge Discovery

TL;DR: A class of methods are introduced that begin by using a single database pass to perform a partial computation of the totals required, storing these in the form of a set enumeration tree, which is created in time linear to the size of the database.

...read moreread less

Abstract: A well-known approach to Knowledge Discovery in Databases involves the identification of association rules linking database attributes. Extracting all possible association rules from a database, however, is a computationally intractable problem, because of the combinatorial explosion in the number of sets of attributes for which incidence-counts must be computed. Existing methods for dealing with this may involve multiple passes of the database, and tend still to cope badly with densely-packed database records. We describe here a class of methods we have introduced that begin by using a single database pass to perform a partial computation of the totals required, storing these in the form of a set enumeration tree, which is created in time linear to the size of the database. Algorithms for using this structure to complete the count summations are discussed, and a method is described, derived from the well-known Apriori algorithm. Results are presented demonstrating the performance advantage to be gained from the use of this approach. Finally, we discuss possible further applications of the method.

...read moreread less

134 citations

Journal Article•DOI•

Pushing Convertible Constraints in Frequent Itemset Mining

[...]

Jian Pei¹, Jiawei Han², Laks V. S. Lakshmanan³•Institutions (3)

State University of New York System¹, University of Illinois at Urbana–Champaign², University of British Columbia³

Maximum and Minimum Likelihood Hebbian Learning for Exploratory Projection Pursuit

TL;DR: A notion of convertible constraints is developed and systematically analyze, classify, and characterize this class and techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining are developed.

...read moreread less

Abstract: Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. Constraint pushing techniques have been developed for mining frequent patterns and associations with antimonotonic, monotonic, and succinct constraints. In this paper, we study constraints which cannot be handled with existing theory and techniques in frequent pattern mining. For example, avg(S)θv, median(S)θv, sum(S)θv (S can contain items of arbitrary values, θ ∈ l>, <, ≤, ≥r and v is a real number.) are customarily regarded as “tough” constraints in that they cannot be pushed inside an algorithm such as Apriori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.

...read moreread less

130 citations

Journal Article•DOI•

[...]

Emilio Corchado¹, Donald MacDonald¹, Colin Fyfe¹•Institutions (1)

University of the West of Scotland¹

Building an Association Rules Framework to Improve Product Assortment Decisions

TL;DR: In this paper, an extension of the learning rules in a Principal Component Analysis network which has been derived to be optimal for a specific probability density function is presented. But the authors do not consider the probability density functions in this paper, and instead view the whole family of learning rules as methods of performing exploration pursuit.

...read moreread less

Abstract: In this paper, we review an extension of the learning rules in a Principal Component Analysis network which has been derived to be optimal for a specific probability density function. We note that this probability density function is one of a family of pdfs and investigate the learning rules formed in order to be optimal for several members of this family. We show that, whereas we have previously (Lai et al., 2000s Fyfe and MacDonald, 2002) viewed the single member of the family as an extension of PCA, it is more appropriate to view the whole family of learning rules as methods of performing Exploratory Projection Pursuit. We illustrate this on both artificial and real data sets.

...read moreread less

106 citations

Journal Article•DOI•

[...]

Tom Brijs, Gilbert Swinnen, Koen Vanhoof, Geert Wets

01 Jan 2004-Data Mining and Knowledge Discovery

TL;DR: This study integrates the discovery of frequent itemsets with a (microeconomic) model for product selection (PROFSET) that enables the integration of both quantitative and qualitative criteria and demonstrates that the impact of product assortment decisions on overall assortment profitability can easily be evaluated by means of sensitivity analysis.

...read moreread less

Abstract: It has been claimed that the discovery of association rules is well suited for applications of market basket analysis to reveal regularities in the purchase behaviour of customers. However today, one disadvantage of associations discovery is that there is no provision for taking into account the business value of an association. Therefore, recent work indicates that the discovery of interesting rules can in fact best be addressed within a microeconomic framework. This study integrates the discovery of frequent itemsets with a (microeconomic) model for product selection (PROFSET). The model enables the integration of both quantitative and qualitative (domain knowledge) criteria. Sales transaction data from a fully automated convenience store are used to demonstrate the effectiveness of the model against a heuristic for product selection based on product-specific profitability. We show that with the use of frequent itemsets we are able to identify the cross-sales potential of product items and use this information for better product selection. Furthermore, we demonstrate that the impact of product assortment decisions on overall assortment profitability can easily be evaluated by means of sensitivity analysis.

...read moreread less

84 citations

Journal Article•DOI•

Fast and Robust General Purpose Clustering Algorithms

[...]

Vladimir Estivill-Castro¹, Jianhua Yang²•Institutions (2)

Griffith University¹, University of Western Sydney²

Efficient Mining of Frequent Patterns Using Ascending Frequency Ordered Prefix-Tree

TL;DR: This work proposes an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers by using medians rather than means as estimators for the centers of clusters.

...read moreread less

Abstract: General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-MEANS has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-MEANS has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using medians rather than means as estimators for the centers of clusters. Comparison with k-MEANS, EXPECTATION and MAXIMIZATION sampling demonstrates the advantages of our algorithm.

...read moreread less

81 citations

Journal Article•DOI•

[...]

Guimei Liu¹, Hongjun Lu¹, Wenwu Lou¹, Yabo Xu², Jeffrey Xu Yu² - Show less +1 more•Institutions (2)

Hong Kong University of Science and Technology¹, The Chinese University of Hong Kong²

01 Nov 2004-Data Mining and Knowledge Discovery

TL;DR: The key factors that influence the performance of the pattern growth approach are identified, and the combination of the top-down traversal strategy and the ascending frequency order achieves significant performance improvement over previous works.

...read moreread less

Abstract: Mining frequent patterns, including mining frequent closed patterns or maximal patterns, is a fundamental and important problem in data mining area. Many algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-and-test approach, especially when long patterns exist in the datasets. In this paper, we identify the key factors that influence the performance of the pattern growth approach, and optimize them to further improve the performance. Our algorithm uses a simple while compact data structure—ascending frequency ordered prefix-tree (AFOPT) to store the conditional databases, in which we use arrays to store single branches to further save space. The AFOPT structure is traversed in top-down depth-first order. Our analysis and experiment results show that the combination of the top-down traversal strategy and the ascending frequency order achieves significant performance improvement over previous works.

...read moreread less

72 citations

Journal Article•DOI•

Building Association-Rule Based Sequential Classifiers for Web-Document Prediction

[...]

Qiang Yang¹, Tianyi Li¹, Ke Wang¹•Institutions (1)

Simon Fraser University¹

Communication-Efficient Distributed Mining of Association Rules

TL;DR: A comparative study on different kinds of sequential association rules for web document prediction shows that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules.

...read moreread less

Abstract: Web servers keep track of web users' browsing behavior in web logs. From these logs, one can build statistical models that predict the users' next requests based on their current behavior. These data are complex due to their large size and sequential nature. In the past, researchers have proposed different methods for building association-rule based prediction models using the web logs, but there has been no systematic study on the relative merits of these methods. In this paper, we provide a comparative study on different kinds of sequential association rules for web document prediction. We show that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules. From this comparison we propose a best overall method and empirically test the proposed model on real web logs.

...read moreread less

Journal Article•DOI•

[...]

Assaf Schuster¹, Ran Wolff¹•Institutions (1)

Technion – Israel Institute of Technology¹

Enhancing Product Recommender Systems on Sparse Binary Data

TL;DR: This paper presents a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication and are also extremely robust.

...read moreread less

Abstract: Mining for associations between items in large transactional databases is a central problem in the field of knowledge discovery. When the database is partitioned among several share-nothing machines, the problem can be addressed using distributed data mining algorithms. One such algorithm, called CD, was proposed by Agrawal and Shafer and was later enhanced by the FDM algorithm of Cheung, Han et al. The main problem with these algorithms is that they do not scale well with the number of partitions. They are thus impractical for use in modern distributed environments such as peer-to-peer systems, in which hundreds or thousands of computers may interact. In this paper we present a set of new algorithms that solve the Distributed Association Rule Mining problem using far less communication. In addition to being very efficient, the new algorithms are also extremely robust. Unlike existing algorithms, they continue to be efficient even when the data is skewed or the partition sizes are imbalanced. We present both experimental and theoretical results concerning the behavior of these algorithms and explain how they can be implemented in different settings.

...read moreread less

Journal Article•DOI•

[...]

Ayhan Demiriz¹•Institutions (1)

Verizon Communications¹

Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates

TL;DR: An empirical comparison between the methods used in e-VZpro and other collaborative filtering methods including dependency networks, item-based, and association mining is provided in this paper.

...read moreread less

Abstract: Commercial recommender systems use various data mining techniques to make appropriate recommendations to users during online, real-time sessions. Published algorithms focus more on the discrete user ratings instead of binary results, which hampers their predictive capabilities when usage data is sparse. The system proposed in this paper, e-VZpro, is an association mining-based recommender tool designed to overcome these problems through a two-phase approach. In the first phase, batches of customer historical data are analyzed through association mining in order to determine the association rules for the second phase. During the second phase, a scoring algorithm is used to rank the recommendations online for the customer. The second phase differs from the traditional approach and an empirical comparison between the methods used in e-VZpro and other collaborative filtering methods including dependency networks, item-based, and association mining is provided in this paper. This comparison evaluates the algorithms used in each of the above methods using two internal customer datasets and a benchmark dataset. The results of this comparison clearly show that e-VZpro performs well compared to dependency networks and association mining. In general, item-based algorithms with cosine similarity measures have the best performance.

...read moreread less

Journal Article•DOI•

[...]

Tapio Elomaa¹, Juho Rousu¹•Institutions (1)

University of Helsinki¹

A Subsequence Matching Algorithm that Supports Normalization Transform in Time-Series Databases

TL;DR: A technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm that works for all convex and cumulative evaluation functions.

...read moreread less

Abstract: We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms. We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality. Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions. Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.

...read moreread less

Journal Article•DOI•

[...]

Woong-Kee Loh¹, Sang-Wook Kim², Kyu-Young Whang¹•Institutions (2)

KAIST¹, Hanyang University²

Hypergraph Models and Algorithms for Data-Pattern-Based Clustering

TL;DR: It is formally proved that the proposed algorithm for subsequence matching that supports normalization transform in time-series databases does not cause false dismissal and is suitable for practical situations, where the queries with smaller selectivities are much more frequent.

...read moreread less

Abstract: In this paper, an algorithm is proposed for subsequence matching that supports normalization transform in time-series databases Normalization transform enables finding sequences with similar fluctuation patterns even though they are not close to each other before the normalization transform Simple application of existing subsequence matching algorithms to support normalization transform is not feasible since the algorithms do not have information for normalization transform of subsequences of arbitrary lengths Application of the existing whole matching algorithm supporting normalization transform to the subsequence matching is feasible, but requires an index for every possible length of the query sequence causing serious overhead on both storage space and update time The proposed algorithm generates indexes only for a small number of different lengths of query sequences For subsequence matching it selects the most appropriate index among them Better search performance can be obtained by using more indexes In this paper, the approach is called index interpolation It is formally proved that the proposed algorithm does not cause false dismissal The search performance can be traded off with storage space by adjusting the number of indexes For performance evaluation, a series of experiments is conducted using the indexes for only five different lengths out of lengths 256~512 of the query sequence The results show that the proposed algorithm outperforms the sequential scan by up to 24 times on the average when the selectivity of the query is 10?2 and up to 146 times when it is 10?5 Since the proposed algorithm performs better with smaller selectivities, it is suitable for practical situations, where the queries with smaller selectivities are much more frequent

...read moreread less

Journal Article•DOI•

[...]

Muhammet Mustafa Ozdal¹, Cevdet Aykanat²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Bilkent University²

Mining Surprising Periodic Patterns

TL;DR: This work proposes a novel hypergraph model to represent the relations among the patterns, and introduces a vertex-to-cluster affinity concept to enable the use of existing metrics in the second phase.

...read moreread less

Abstract: In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depends on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two-phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.

...read moreread less

Journal Article•DOI•

[...]

Jiong Yang¹, Wei Wang², Philip S. Yu³•Institutions (3)

University of Illinois at Urbana–Champaign¹, University of North Carolina at Chapel Hill², IBM³

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

TL;DR: The concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence and the bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure.

...read moreread less

Abstract: In this paper, we focus on mining surprising periodic patterns in a sequence of events. In many applications, e.g., computational biology, an infrequent pattern is still considered very significant if its actual occurrence frequency exceeds the prior expectation by a large margin. The traditional metric, such as support, is not necessarily the ideal model to measure this kind of surprising patterns because it treats all patterns equally in the sense that every occurrence carries the same weight towards the assessment of the significance of a pattern regardless of the probability of occurrence. A more suitable measurement, information, is introduced to naturally value the degree of surprise of each occurrence of a pattern as a continuous and monotonically decreasing function of its probability of occurrence. This would allow patterns with vastly different occurrence probabilities to be handled seamlessly. As the accumulated degree of surprise of all repetitions of a pattern, the concept of information gain is proposed to measure the overall degree of surprise of the pattern within a data sequence. The bounded information gain property is identified to tackle the predicament caused by the violation of the downward closure property by the information gain measure and in turn provides an efficient solution to this problem. Furthermore, the user has a choice between specifying a minimum information gain threshold and choosing the number of surprising patterns wanted. Empirical tests demonstrate the efficiency and the usefulness of the proposed model.

...read moreread less

Journal Article•DOI•

[...]

Manuel Castejón Limas¹, Joaquin Ordieres Mere², Francisco Javier Martínez de Pisón Ascacíbar², Eliseo P. Vergara González²•Institutions (2)

University of León¹, University of La Rioja²

Incremental Maintenance on the Border of the Space of Emerging Patterns

TL;DR: A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed based on an iterated local fit without a priori metric assumptions which provides good results with large data sets.

...read moreread less

Abstract: A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

...read moreread less

Journal Article•DOI•

[...]

Jinyan Li¹, Thomas Manoukian², Guozhu Dong³, Kotagiri Ramamohanarao²•Institutions (3)

Institute for Infocomm Research Singapore¹, University of Melbourne², Wright State University³

Ratio Selection for Classification Models

TL;DR: This paper studies how to incrementally modify and maintain the concise boundary descriptions of the space of all emerging patterns when small changes occur to the data, and introduces algorithms to handle four types of changes.

...read moreread less

Abstract: Emerging patterns (EPs) are useful knowledge patterns with many applications. In recent studies on bio-medical profiling data, we have successfully used such patterns to solve difficult cancer diagnosis problems and produced higher classification accuracy when compared to alternative methods. However, the discovery of EPs is a challenging and computationally expensive problem. In this paper, we study how to incrementally modify and maintain the concise boundary descriptions of the space of all emerging patterns when small changes occur to the data. As EP spaces are convex, the maintenance on the bounds guarantees that no desired patterns are lost. We introduce algorithms to handle four types of changes: insertion of new data, deletion of old data, addition of new attributes, and deletion of old attributes. We compare these incremental algorithms, on six benchmark data sets, against an efficient algorithm that computes from scratch. The results show that the incremental algorithms are much faster than the From-Scratch method, often with tremendous speed-up rates.

...read moreread less

Journal Article•DOI•

[...]

Roberto Kawakami Harrop Galvão¹, Victor M. Becerra², Magda Abou-Seada³•Institutions (3)

Instituto Tecnológico de Aeronáutica¹, University of Reading², Middlesex University³

On Leveraging User Access Patterns for Topic Specific Crawling

TL;DR: Two selection techniques are proposed: one based on a pre-selection procedure and anotherbased on a genetic algorithm for the selection of inputs for classification models based on ratios of measured quantities.

...read moreread less

Abstract: This paper is concerned with the selection of inputs for classification models based on ratios of measured quantities. For this purpose, all possible ratios are built from the quantities involved and variable selection techniques are used to choose a convenient subset of ratios. In this context, two selection techniques are proposed: one based on a pre-selection procedure and another based on a genetic algorithm. In an example involving the financial distress prediction of companies, the models obtained from ratios selected by the proposed techniques compare favorably to a model using ratios usually found in the financial distress literature.

...read moreread less

Journal Article•DOI•

[...]

Charu C. Aggarwal¹•Institutions (1)

IBM¹