Showing papers on "Apriori algorithm published in 2005"

PDF

Open Access

Journal Article•DOI•

arules - A Computational Environment for Mining Association Rules and Frequent Item Sets

[...]

Michael Hahsler, Bettina Grün, Kurt Hornik

29 Sep 2005-Journal of Statistical Software

TL;DR: The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.

...read moreread less

Abstract: Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

...read moreread less

470 citations

Proceedings Article•DOI•

An implementation of the FP-growth algorithm

[...]

Christian Borgelt¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

21 Aug 2005

TL;DR: This paper describes a C implementation of the FP-growth algorithm, which contains two variants of the core operation of computing a projection of an FP-tree (the fundamental data structure of theFP- growth algorithm), and reports experimental results comparing this implementation with three other frequent item set mining algorithms I implemented.

...read moreread less

Abstract: The FP-growth algorithm is currently one of the fastest approaches to frequent item set mining. In this paper I describe a C implementation of this algorithm, which contains two variants of the core operation of computing a projection of an FP-tree (the fundamental data structure of the FP-growth algorithm). In addition, projected FP-trees are (optionally) pruned by removing items that have become infrequent due to the projection (an approach that has been called FP-Bonsai). I report experimental results comparing this implementation of the FP-growth algorithm with three other frequent item set mining algorithms I implemented (Apriori, Eclat, and Relim).

...read moreread less

295 citations

Journal Article•DOI•

Generating a Condensed Representation for Association Rules

[...]

Nicolas Pasquier¹, Rafik Taouil², Yves Bastide³, Gerd Stumme⁴, Lotfi Lakhal⁵ - Show less +1 more•Institutions (5)

University of Nice Sophia Antipolis¹, François Rabelais University², French Institute for Research in Computer Science and Automation³, University of Kassel⁴, Centre national de la recherche scientifique⁵

01 Jan 2005

TL;DR: A condensed representation for association rules is defined using a semantic based on the closure of the Galois connection that contains the non-redundant association rules having minimal antecedent and maximal consequent, called min-max association rules.

...read moreread less

Abstract: Association rule extraction from operational datasets often produces several tens of thousands, and even millions, of association rules. Moreover, many of these rules are redundant and thus useless. Using a semantic based on the closure of the Galois connection, we define a condensed representation for association rules. This representation is characterized by frequent closed itemsets and their generators. It contains the non-redundant association rules having minimal antecedent and maximal consequent, called min-max association rules. We think that these rules are the most relevant since they are the most general non-redundant association rules. Furthermore, this representation is a basis, i.e., a generating set for all association rules, their supports and their confidences, and all of them can be retrieved needless accessing the data. We introduce algorithms for extracting this basis and for reconstructing all association rules. Results of experiments carried out on real datasets show the usefulness of this approach. In order to generate this basis when an algorithm for extracting frequent itemsets--such as Apriori for instance--is used, we also present an algorithm for deriving frequent closed itemsets and their generators from frequent itemsets without using the dataset.

...read moreread less

206 citations

Proceedings Article•

WFIM: Weighted Frequent Itemset Mining with a weight range and a minimum weight.

[...]

Unil Yun, John J. Leggett

01 Jan 2005

TL;DR: WFIM generates more concise and important weighted frequent itemsets in large databases, particularly dense databases with low minimum support, by adjusting a minimum weight and a weight range.

...read moreread less

Abstract: Researchers have proposed weighted frequent itemset mining algorithms that reflect the importance of items. The main focus of weighted frequent itemset mining concerns satisfying the downward closure property. All weighted association rule mining algorithms suggested so far have been based on the Apriori algorithm. However, pattern growth algorithms are more efficient than Apriori based algorithms. Our main approach is to push the weight constraints into the pattern growth algorithm while maintaining the downward closure property. In this paper, a weight range and a minimum weight constraint are defined and items are given different weights within the weight range. The weight and support of each item are considered separately for pruning the search space. The number of weighted frequent itemsets can be reduced by setting a weight range and a minimum weight, allowing the user to balance support and weight of itemsets. WFIM generates more concise and important weighted frequent itemsets in large databases, particularly dense databases with low minimum support, by adjusting a minimum weight and a weight range.

...read moreread less

152 citations

Book Chapter•DOI•

Finding sporadic rules using apriori-inverse

[...]

Yun Sing Koh¹, Nathan Rountree¹•Institutions (1)

University of Otago¹

18 May 2005

TL;DR: This work proposes “Apriori-Inverse”, a method of discovering sporadic rules by ignoring all candidate itemsets above a maximum support threshold, and shows that Apriori -Inverse finds all perfectly sporadic rules much more quickly than Aprioro.

...read moreread less

Abstract: We define sporadic rules as those with low support but high confidence: for example, a rare association of two symptoms indicating a rare disease. To find such rules using the well-known Apriori algorithm, minimum support has to be set very low, producing a large number of trivial frequent itemsets. We propose “Apriori-Inverse”, a method of discovering sporadic rules by ignoring all candidate itemsets above a maximum support threshold. We define two classes of sporadic rule: perfectly sporadic rules (those that consist only of items falling below maximum support) and imperfectly sporadic rules (those that may contain items over the maximum support threshold). We show that Apriori-Inverse finds all perfectly sporadic rules much more quickly than Apriori. We also propose extensions to Apriori-Inverse to allow us to find some (but not necessarily all) imperfectly sporadic rules.

...read moreread less

151 citations

Proceedings Article•DOI•

An algorithm for in-core frequent itemset mining on streaming data

[...]

Ruoming Jin¹, Gagan Agrawal²•Institutions (2)

Kent State University¹, Ohio State University²

27 Nov 2005

TL;DR: A one pass algorithm for frequent item set mining is presented, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure, and can be easily extended to a two pass accurate algorithm.

...read moreread less

Abstract: Frequent item set mining is a core data mining operation and has been extensively studied over the last decade. This paper takes a new approach for this problem and makes two major contributions. First, we present a one pass algorithm for frequent item set mining, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because our one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. Our two pass algorithm is very memory efficient, and allows mining of datasets with large number of distinct items and/or very low support levels. Our detailed experimental evaluation on synthetic and real datasets shows the following. First, our one pass algorithm is very accurate in practice. Second, our algorithm requires significantly lower memory than Manku and Motwani's one pass algorithm and the multi-pass Apriori algorithm. Our two pass algorithm outperforms Apriori and FP-tree when the number of distinct items is large and/or support levels are very low. In other cases, it is quite competitive, with possible exception of cases where the average length of frequent item sets is quite high.

...read moreread less

147 citations

Proceedings Article•DOI•

Efficient hardware data mining with the Apriori algorithm on FPGAs

[...]

Zachary K. Baker¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

18 Apr 2005

TL;DR: This work introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling in the Apriori algorithm.

...read moreread less

Abstract: The Apriori algorithm is a popular correlation-based data mining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx Virtex-II Pro 100 provides a performance improvement that can be orders of magnitude faster than the state-of-the-art software implementations. The system is easily scalable and introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling.

...read moreread less

101 citations

Proceedings Article•DOI•

A trie-based APRIORI implementation for mining frequent item sequences

[...]

Ferenc Bodon¹•Institutions (1)

Budapest University of Technology and Economics¹

21 Aug 2005

TL;DR: In this paper, a trie-based APRIORI algorithm for mining frequent item sequences in a transactional database is investigated, mainly focusing on those that also arise in frequent itemset mining.

...read moreread less

Abstract: In this paper we investigate a trie-based APRIORI algorithm for mining frequent item sequences in a transactional database. We examine the data structure, implementation and algorithmic features mainly focusing on those that also arise in frequent itemset mining. In our analysis we take into consideration modern processors' properties (memory hierarchies, prefetching, branch prediction, cache line size, etc.), in order to better understand the results of the experiments.

...read moreread less

93 citations

Book Chapter•DOI•

Frequent Set Mining

[...]

Bart Goethals¹•Institutions (1)

University of Antwerp¹

01 Jan 2005

TL;DR: This chapter attempts to survey the most successful algorithms and techniques that try to solve the frequent set mining problem efficiently.

...read moreread less

Abstract: Frequent sets lie at the basis of many Data Mining algorithms. As a result, hundreds of algorithms have been proposed in order to solve the frequent set mining problem. In this chapter, we attempt to survey the most successful algorithms and techniques that try to solve this problem efficiently.

...read moreread less

91 citations

The Apriori Algorithm – a Tutorial

[...]

Markus Hegland¹, John Dedman, M. Hegland¹•Institutions (1)

Australian National University¹

01 Jan 2005

TL;DR: Basic concepts of association rule discovery are reviewed including support, confidence, the apriori property, constraints and parallel algorithms.

...read moreread less

Abstract: Association rules are ”if-then rules” with two measures which quantify the support and confidence of the rule for a given data set. Having their origin in market basked analysis, association rules are now one of the most popular tools in data mining. This popularity is to a large part due to the availability of efficient algorithms. The first and arguably most influential algorithm for efficient association rule discovery is Apriori. In the following we will review basic concepts of association rule discovery including support, confidence, the apriori property, constraints and parallel algorithms. The core consists of a review of the most important algorithms for association rule discovery. Some familiarity with concepts like predicates, probability, expectation and random variables is assumed.

...read moreread less

82 citations

Proceedings Article•DOI•

Keeping things simple: finding frequent item sets by recursive elimination

[...]

Christian Borgelt¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

21 Aug 2005

TL;DR: Recursive elimination is an algorithm for finding frequent item sets, which is strongly inspired by the FP-growth algorithm and very similar to the H-mine algorithm, and can be written with relatively few lines of code.

...read moreread less

Abstract: Recursive elimination is an algorithm for finding frequent item sets, which is strongly inspired by the FP-growth algorithm and very similar to the H-mine algorithm. It does its work without prefix trees or any other complicated data structures, processing the transactions directly. Its main strength is not its speed (although it is not slow, even outperforms Apriori and Eclat on some data sets), but the simplicity of its structure. Basically all the work is done in one simple recursive function, which can be written with relatively few lines of code.

...read moreread less

Journal Article•DOI•

Using information retrieval techniques for supporting data mining

[...]

Ioannis N. Kouris¹, Christos Makris², Athanasios K. Tsakalidis¹•Institutions (2)

Research Academic Computer Technology Institute¹, University of Patras²

01 Mar 2005

TL;DR: This work argues that there is no need to search all the search space of candidate itemsets but rather let the database unveil its secrets as the customers use it, and proposes a system that acts like a search engine specifically implemented for making recommendations to the customers using techniques borrowed from Information Retrieval.

...read moreread less

Abstract: The classic two-stepped approach of the Apriori algorithm and its descendants, which consisted of finding all large itemsets and then using these itemsets to generate all association rules has worked well for certain categories of data Nevertheless for many other data types this approach shows highly degraded performance and proves rather inefficientWe argue that we need to search all the search space of candidate itemsets but rather let the database unveil its secrets as the customers use it We propose a system that does not merely scan all possible combinations of the itemsets, but rather acts like a search engine specifically implemented for making recommendations to the customers using techniques borrowed from Information Retrieval

...read moreread less

Book Chapter•DOI•

A matrix algorithm for mining association rules

[...]

Yubo Yuan¹, Tingzhu Huang¹•Institutions (1)

University of Electronic Science and Technology of China¹

23 Aug 2005

TL;DR: A new algorithm for efficient generating large frequent candidate sets is proposed, which is called Matrix Algorithm, which generates a matrix which entries 1 or 0 by passing over the cruel database only once, and then the frequent Candidate sets are obtained from the resulting matrix.

...read moreread less

Abstract: Finding association rules is an important data mining problem and can be derived based on mining large frequent candidate sets. In this paper, a new algorithm for efficient generating large frequent candidate sets is proposed, which is called Matrix Algorithm. The algorithm generates a matrix which entries 1 or 0 by passing over the cruel database only once, and then the frequent candidate sets are obtained from the resulting matrix. Finally association rules are mined from the frequent candidate sets. Numerical experiments and comparison with the Apriori Algorithm are made on 4 randomly generated test problems with small, middle and large sizes. Experiments results confirm that the proposed algorithm is more effective than Apriori Algorithm.

...read moreread less

Book Chapter•DOI•

A fast algorithm for mining share-frequent itemsets

[...]

Yu-Chiang Li¹, Jieh-Shan Yeh², Chin-Chen Chang¹•Institutions (2)

National Chung Cheng University¹, Providence College²

29 Mar 2005

TL;DR: Simulation results reveal that the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude between 0.2% and 2% minimum share thresholds.

...read moreread less

Abstract: Itemset share has been proposed as a measure of the importance of itemsets for mining association rules. The value of the itemset share can provide useful information such as total profit or total customer purchased quantity associated with an itemset in database. The discovery of share-frequent itemsets does not have the downward closure property. Existing algorithms for discovering share-frequent itemsets are inefficient or do not find all share-frequent itemsets. Therefore, this study proposes a novel Fast Share Measure (FSM) algorithm to efficiently generate all share-frequent itemsets. Instead of the downward closure property, FSM satisfies the level closure property. Simulation results reveal that the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude between 0.2% and 2% minimum share thresholds.

...read moreread less

Proceedings Article•

Association Rules Mining Using Heavy Itemsets.

[...]

Girish Keshav Palshikar¹, Mandar S. Kale², Manoj Apte²•Institutions (2)

Tata Research Development and Design Centre¹, Tata Consultancy Services²

01 Jan 2005

TL;DR: The concept of heavy itemset was introduced in this article, where the authors show that an itemset A is heavy (for given support and confidence values) if all possible association rules made up of items only in A are present.

...read moreread less

Abstract: A well-known problem that limits the practical usage of association rule mining algorithms is the extremely large number of rules generated. Such a large number of rules makes the algorithms inefficient and makes it difficult for the end users to comprehend the discovered rules. We present the concept of a heavy itemset. An itemset A is heavy (for given support and confidence values) if all possible association rules made up of items only in A are present. We prove a simple necessary and sufficient condition for an itemset to be heavy. We present a formula for the number of possible rules for a given heavy itemset, and show that a heavy itemset compactly represents an exponential number of association rules. Along with two simple search algorithms, we present an efficient greedy algorithm to generate a collection of disjoint heavy itemsets in a given transaction database. We then present a modified apriori algorithm that starts with a given collection of disjoint heavy itemsets and discovers more heavy itemsets, not necessarily disjoint with the given ones.

...read moreread less

Proceedings Article•

The fuzzy frequent pattern Tree

[...]

Stergios Papadimitriou¹, Seferina Mavroudi²•Institutions (2)

Technological Educational Institute of Kavala¹, University of Patras²

14 Jul 2005

TL;DR: The paper presents an application of the Fuzzy Frequent Pattern Tree approach to the difficult problem of the discovery of fuzzy association rules between genes from massive gene expression measurements.

...read moreread less

Abstract: A significant data mining issue is the effective discovery of association rules. The extraction of association rules faces the problem of the combinatorial explosion of the search space, and the loss of information by the discretization of values. The first problem is confronted effectively by the Frequent Pattern Tree approach of [10]. This approach avoids the candidate generation phase of Apriori like algorithms. But, the discretization of the values of the attributes (i.e. the "items") at the basic Frequent Pattern Tree approach implies a loss of information. This loss usually either deteriorates significantly the results, or constitues them completely intolerable. This work extends appropriately the Frequent Pattern Tree approach in the fuzzy domain. The presented Fuzzy Frequent Pattern Tree retains the efficiency of the crisp Frequent Pattern Tree, while at the same time the careful updating of the fuzzy sets at all the phases of the algorithm tries to preserve most of the original information at the data set. The paper presents an application of the Fuzzy Frequent Pattern Tree approach to the difficult problem of the discovery of fuzzy association rules between genes from massive gene expression measurements.

...read moreread less

Book Chapter•DOI•

A rough set based model to rank the importance of association rules

[...]

Jiye Li¹, Nick Cercone²•Institutions (2)

University of Waterloo¹, Dalhousie University²

31 Aug 2005

TL;DR: A rough set based process by which a rule importance measure is calculated for association rules to select the most appropriate rules to reduce the number of rules generated and provide a measure of how important is a rule.

...read moreread less

Abstract: Association rule algorithms often generate an excessive number of rules, many of which are not significant. It is difficult to determine which rules are more useful, interesting and important. We introduce a rough set based process by which a rule importance measure is calculated for association rules to select the most appropriate rules. We use ROSETTA software to generate multiple reducts. Apriori association rule algorithm is then applied to generate rule sets for each data set based on each reduct. Some rules are generated more frequently than the others among the total rule sets. We consider such rules as more important. We define rule importance as the frequency of an association rule among the rule sets. Rule importance is different from rule interestingness in that it does not consider the predefined knowledge on what kind of information is considered to be interesting. The experimental results show our method reduces the number of rules generated and at the same time provides a measure of how important is a rule.

...read moreread less

Book Chapter•DOI•

The research of mining association rules between personality and behavior of learner under web-based learning environment

[...]

Jin Du¹, Qinghua Zheng¹, Haifei Li², Wenbin Yuan¹•Institutions (2)

Xi'an Jiaotong University¹, Union University²

31 Jul 2005

TL;DR: An improved algorithm, which is based on Apriori algorithm widely used in market basket analysis, is designed and implemented, and result indicated that this algorithm for mining association rules between behavior and personality is feasible and efficient.

...read moreread less

Abstract: Discovering the relationship between behavior and personality of learner in the web-based learning environment is a key to guide learners in the learning process. This paper proposes a new concept called personality mining to find the “deep” personality through the observed data about the behavior. First, a learner model which includes personality model and behavior model is proposed. Second, we have designed and implemented an improved algorithm, which is based on Apriori algorithm widely used in market basket analysis, to identify the relationship. Third, we have discussed various issues like constructing the learner model, unifying the value domain of heterogeneous model attributes, and improving Apriori algorithm with decision domain. Experiment result indicated that this algorithm for mining association rules between behavior and personality is feasible and efficient. The algorithm has been used in a web-based learning environment developed at Xi'an Jiaotong University.

...read moreread less

Book Chapter•DOI•

Finding locally and periodically frequent sets and periodic association rules

[...]

A. Kakoti Mahanta¹, Fokrul Alom Mazarbhuiya¹, Hemanta K. Baruah¹•Institutions (1)

Gauhati University¹

20 Dec 2005

TL;DR: Modification to the Apriori algorithm to compute locally frequent sets and periodic frequent Sets and periodic association rules, which helps to find association rules within the items in the frequent sets.

...read moreread less

Abstract: The problem of finding association rules from a dataset is to find all possible associations that hold among the items, given a minimum support and confidence. This involves finding frequent sets first and then the association rules that hold within the items in the frequent sets. In temporal datasets as the time in which a transaction takes place is important we may find sets of items that are frequent in certain time intervals but not frequent throughout the dataset. These frequent sets may give rise to interesting rules but these can not be discovered if we calculate the supports of the item sets in the usual way. We call here these frequent sets locally frequent. Normally these locally frequent sets are periodic in nature. We propose modification to the Apriori algorithm to compute locally frequent sets and periodic frequent sets and periodic association rules.

...read moreread less

Journal Article•DOI•

Efficiently Mining of Maximal Frequent Item Sets Based on FP-Tree

[...]

Yan Yuejin¹, Li Zhoujun, Chen Huowang•Institutions (1)

National University of Defense Technology¹

01 Jan 2005-Journal of Software

TL;DR: It is proved by theoretical analysis that FPMFI has superiority and it is revealed by experimental comparison that the performance of F PMFI is superior to that of the similar algorithm based on FP-Tree more than one time.

...read moreread less

Abstract: During the process of mining maximal frequent item sets, when minimum support is little, superset checking is a kind of time-consuming and frequent operation in the mining algorithm. In this paper, a new algorithm FPMFI (frequent pattern tree for maximal frequent item sets) for mining maximal frequent item sets is proposed. It adopts a new superset checking method based on projection of the maximal frequent item sets, which efficiently reduces the cost of superset checking. In addition, FPMFI also compresses the conditional FP-Tree (frequent pattern tree) greatly by deleting the redundant information, which can reduce the cost of accessing the tree. It is proved by theoretical analysis that FPMFI has superiority and it is revealed by experimental comparison that the performance of FPMFI is superior to that of the similar algorithm based on FP-Tree more than one time.

...read moreread less

Journal Article•DOI•

Efficient breadth-first mining of frequent pattern with monotone constraints

[...]

Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, Dino Pedreschi¹•Institutions (1)

University of Pisa¹

01 Aug 2005-Knowledge and Information Systems

TL;DR: ExAMiner is a breadth-first algorithm that exploits the real synergy of antimonotone and monotone constraints: the total benefit is greater than the sum of the two individual benefits.

...read moreread less

Abstract: The key point of this article is that, in frequent pattern mining, the most appropriate way of exploiting monotone constraints in conjunction with frequency is to use them in order to reduce the input data; this reduction in turn induces a stronger pruning of the search space of the problem. Following this intuition, we introduce ExAMiner, a breadth-first algorithm that exploits the real synergy of antimonotone and monotone constraints: the total benefit is greater than the sum of the two individual benefits. ExAMiner generalizes the basic idea of the preprocessing algorithm ExAnte (Bonchi et al. 2003(b)), embedding such ideas at all levels of an Apriori-like computation. The resulting algorithm is the generalization of the Apriori algorithm when a conjunction of monotone constraints is conjoined to the frequency antimonotone constraint. Experimental results confirm that this is, so far, the most efficient way of attacking the computational problem in analysis.

...read moreread less

Proceedings Article•DOI•

A thorough experimental study of datasets for frequent itemsets

[...]

Frédéric Flouvat¹, F. De March², Jean-Marc Petit³•Institutions (3)

Centre national de la recherche scientifique¹, University of Lyon², Institut national des sciences Appliquées de Lyon³

27 Nov 2005

TL;DR: A thorough experimental study of datasets with respect to frequent item sets, exhibiting a new characterization of datasets and some invariants allowing to better predict the behavior of well known algorithms.

...read moreread less

Abstract: The discovery of frequent patterns is a famous problem in data mining. While plenty of algorithms have been proposed during the last decade, only a few contributions have tried to understand the influence of datasets on the algorithms behavior. Being able to explain why certain algorithms are likely to perform very well or very poorly on some datasets is still an open question. In this setting, we describe a thorough experimental study of datasets with respect to frequent item sets. We study the distribution of frequent item sets with respect to item sets size together with the distribution of three concise representations: frequent closed, frequent free and frequent essential item sets. For each of them, we also study the distribution of their positive and negative borders whenever possible. From this analysis, we exhibit a new characterization of datasets and some invariants allowing to better predict the behavior of well known algorithms. The main perspective of this work is to devise adaptive algorithms with respect to dataset characteristics.

...read moreread less

Book Chapter•DOI•

A study on text clustering algorithms based on frequent term sets

[...]

Xiangwei Liu¹, Pilian He¹•Institutions (1)

Tianjin University¹

22 Jul 2005

TL;DR: The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering and provide an understandable description of the discovered clusters by the frequent terms sets.

...read moreread less

Abstract: In this paper, a new text-clustering algorithm named Frequent Term Set-based Clustering (FTSC) is introduced. It uses frequent term sets to cluster texts. First, it extracts useful information from documents and inserts into databases. Then, it uses the Apriori algorithm based on association rules mining efficiently to discover the frequent items sets. Finally, it clusters the documents according to the frequent words in subsets of the frequent term sets. This algorithm can reduce the dimension of the text data efficiently for very large databases, thus it can improve the accuracy and speed of the clustering algorithm. The results of clustering texts by the FTSC algorithm cannot reflect the overlap of texts' classes. Based on the FTSC algorithm, an improved algorithm—Frequent Term Set-based Hierarchical Clustering algorithm (FTSHC) is given. This algorithm can determine the overlap of texts' classes by the overlap of the frequent words sets, and provide an understandable description of the discovered clusters by the frequent terms sets. The FTSC, FTSHC and K-Means algorithms are evaluated quantitatively by experiments. The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering.

...read moreread less

A collaborative filtering approach to predict web pages of interest from navigation patterns of past users within an academic website

[...]

Stephen C. Hirtle¹, Denis Lemongew Nkweteyim¹•Institutions (1)

University of Pittsburgh¹

01 Jan 2005

TL;DR: It is demonstrated that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed into collaborative filtering data, and the weighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does.

...read moreread less

Abstract: This dissertation is a simulation study of factors and techniques involved in designing hyperlink recommender systems that recommend to users, web pages that past users with similar navigation behaviors found interesting. The methodology involves identification of pertinent factors or techniques, and for each one, addresses the following questions: (a) room for improvement; (b) better approach, if any; and (c) performance characteristics of the technique in environments that hyperlink recommender systems operate in. The following four problems are addressed: Web page classification. A new metric (PageRank × Inverse Links-to-Word count ratio) is proposed for classifying web pages as content or navigation, to help in the discovery of user navigation behaviors from web user access logs. Results of a small user study suggest that this metric leads to desirable results. Data mining. A new apriori algorithm for mining association rules from large databases is proposed. The new algorithm addresses the problem of scaling of the classical apriori algorithm by eliminating an expensive join step, and applying the apriori property to every row of the database. In this study, association rules show the correlation relationships between user navigation behaviors and web pages they find interesting. The new algorithm has better space complexity than the classical one, and better time efficiency under some conditions and comparable time efficiency under other conditions. Prediction models for user interests. We demonstrate that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed into collaborative filtering data. We investigate collaborative filtering prediction models based on two approaches for computating prediction scores: using simple averages and weighted averages. Our findings suggest that the weighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does. Clustering. Clustering techniques are frequently applied in the design of personalization systems. We studied the performance of the CLARANS clustering algorithm in high dimensional space in relation to the PAM and CLARA clustering algorithms. While CLARA had the best time performance, CLARANS resulted in clusters with the lowest intra-cluster dissimilarities, and so was most effective in this regard.

...read moreread less

Proceedings Article•DOI•

Visual interface for online watching of frequent itemset generation in Apriori and Eclat

[...]

Aniket Mahanti¹, Reda Alhajj¹•Institutions (1)

University of Calgary¹

15 Dec 2005

TL;DR: This paper describes an interactive graphical user interface tool that can be used to study two famous frequent itemset generation algorithms, namely, Apriori and Eclat, and provides a hands-on environment for doing so.

...read moreread less

Abstract: This paper describes an interactive graphical user interface tool called Visual Apriori that can be used to study two famous frequent itemset generation algorithms, namely, Apriori and Eclat. Understanding the functional behavior of these two algorithms is critical for students taking a data mining course; and Visual Apriori provides a hands-on environment for doing so. Visual Apriori relies on active participation from the user, where one inputs a transactional database and the tool produces a tree-based frequent itemset generation animation for the algorithm chosen. Visual Apriori provides an effortless learning experience by featuring user-friendly and easy to understand controls.

...read moreread less

Book Chapter•DOI•

Mining expressive temporal associations from complex data

[...]

Keith A. Pray¹, Carolina Ruiz²•Institutions (2)

BAE Systems¹, Worcester Polytechnic Institute²

09 Jul 2005

TL;DR: It is shown that AprioriSetsAndSequences produces rules that express significant temporal relationships that describe patterns of behavior observed in the data set, and inherently handles different levels of time granularity in the same data set.

...read moreread less

Abstract: We introduce an algorithm for mining expressive temporal relationships from complex data. Our algorithm, AprioriSetsAndSequences (ASAS), extends the Apriori algorithm to data sets in which a single data instance may consist of a combination of attribute values that are nominal sequences, time series, sets, and traditional relational values. Data sets of this type occur naturally in many domains including health care, financial analysis, complex system diagnostics, and domains in which multi-sensors are used. AprioriSetsAndSequences identifies predefined events of interest in the sequential data attributes. It then mines for association rules that make explicit all frequent temporal relationships among the occurrences of those events and relationships of those events and other data attributes. Our algorithm inherently handles different levels of time granularity in the same data set. We have implemented AprioriSetsAndSequences within the Weka environment [1] and have applied it to computer performance, stock market, and clinical sleep disorder data. We show that AprioriSetsAndSequences produces rules that express significant temporal relationships that describe patterns of behavior observed in the data set.

...read moreread less

Book Chapter•DOI•

The relation of closed itemset mining, complete pruning strategies and item ordering in apriori-based FIM algorithms

[...]

Ferenc Bodon¹, Lars Schmidt-Thieme²•Institutions (2)

Budapest University of Technology and Economics¹, University of Freiburg²

03 Oct 2005

TL;DR: It is claimed, that when proper item order is used, complete pruning does not necessarily speed up Apriori, and in databases with certain characteristics, pruning increases run time significantly, and if complete pruned is applied, then an intersection-based technique not only results in a faster algorithm, but gets free closed-itemset selection concerning both memory consumption and run-time.

...read moreread less

Abstract: In this paper we investigate the relationship between closed itemset mining, the complete pruning technique and item ordering in the Apriori algorithm. We claim, that when proper item order is used, complete pruning does not necessarily speed up Apriori, and in databases with certain characteristics, pruning increases run time significantly. We also show that if complete pruning is applied, then an intersection-based technique not only results in a faster algorithm, but we get free closed-itemset selection concerning both memory consumption and run-time.

...read moreread less

Proceedings Article•

Efficient sequential pattern mining algorithms

[...]

Renáta Iváncsy¹, István Vajk¹•Institutions (1)

Budapest University of Technology and Economics¹

13 Feb 2005

TL;DR: Two novel methods, the Bitmap-based GSP (BGSP) and the SM-Tree (State Machine-Tree) algorithms are presented as an enhancement of the GSP-based sequential pattern mining approach.

...read moreread less

Abstract: Sequential pattern mining is a heavily researched area in the field of data mining with wide variety of applications The task of discovering frequent sequences is challenging, because the algorithm needs to process a combinatorially explosive number of possible sequences Most of the methods dealing with the sequential pattern mining problem are based on the approach of the traditional task of itemset mining, because the former can be interpreted as the generalization of the latter Several algorithms use a level-wise "candidate generate and test" approach, while others use projected databases to discover the frequent sequences In this paper a classification of the well-known sequence mining algorithm is presented Because each algorithm has its own advantages and drawbacks regarding the execution time and the memory requirements, and the exact aim of the algorithms differs as well, thus an exact ranking of the methods is omitted A basic level-wise algorithm, the GSP is described in detail Because the level-wise algorithms need less memory in general than the projection-based ones, an efficient implementation of the GSP algorithm is also suggested Two novel methods, the Bitmap-based GSP (BGSP) and the SM-Tree (State Machine-Tree) algorithms are presented as an enhancement of the GSP-based sequential pattern mining approach

...read moreread less

Proceedings Article•DOI•

New algorithm for mining frequent itemsets in sparse database

[...]

Fei-yue Ye, Jian-dong Wang, Bi-lin Shao

07 Nov 2005

TL;DR: This paper presents novel algorithm for mining frequent itemsets in sparse database, compared with existing algorithm the authors' algorithm has visible advantage, and the experimental results show that the algorithm is very promising.

...read moreread less

Abstract: This paper presents novel algorithm for mining frequent itemsets in sparse database, compared with existing algorithm our algorithm has visible advantage. With this algorithm, the scans is less in transaction database, only one time in little and middle transaction database, and not more than two times in large database. In the algorithm, when the transaction database is scanned, the transaction items are saved in unit triplet, and the count of every transaction item is saved in 1-dimension array so that the frequent itemsets are generated in memory. So I/O spending is reduced greatly. The experimental results show that our algorithm is very promising.

...read moreread less

Book Chapter•DOI•

An application of apriori algorithm on a diabetic database

[...]

Nevcihan Duru¹•Institutions (1)

Kocaeli University¹

14 Sep 2005

TL;DR: A software (DMAP), which uses Apriori algorithm, which has been usually used for the market basket analysis, was used for analyzing a diabetic database.

...read moreread less

Abstract: In recent days, mining information from large databases has been recognized by many researchers and many data mining techniques and systems have been developed. In this study, a software (DMAP), which uses Apriori algorithm, was developed. Apriori is an influential algorithm that used in data mining. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent item set properties. The software is used for discovering the social status of the diabetics. A diabetic database that belongsto faculty of medicine of Kocaeli University has been used. The software was executed on a database which has records of 66 patients for test purpose. In the literature, diabetic databases have been often analyzed by rough sets. In this paper, Apriori algorithm, which has been usually used for the market basket analysis, was used for analyzing a diabetic database.

...read moreread less