scispace - formally typeset
Search or ask a question

Showing papers on "Apriori algorithm published in 2005"


Journal ArticleDOI
TL;DR: The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.
Abstract: Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

470 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: This paper describes a C implementation of the FP-growth algorithm, which contains two variants of the core operation of computing a projection of an FP-tree (the fundamental data structure of theFP- growth algorithm), and reports experimental results comparing this implementation with three other frequent item set mining algorithms I implemented.
Abstract: The FP-growth algorithm is currently one of the fastest approaches to frequent item set mining. In this paper I describe a C implementation of this algorithm, which contains two variants of the core operation of computing a projection of an FP-tree (the fundamental data structure of the FP-growth algorithm). In addition, projected FP-trees are (optionally) pruned by removing items that have become infrequent due to the projection (an approach that has been called FP-Bonsai). I report experimental results comparing this implementation of the FP-growth algorithm with three other frequent item set mining algorithms I implemented (Apriori, Eclat, and Relim).

295 citations


Journal ArticleDOI
01 Jan 2005
TL;DR: A condensed representation for association rules is defined using a semantic based on the closure of the Galois connection that contains the non-redundant association rules having minimal antecedent and maximal consequent, called min-max association rules.
Abstract: Association rule extraction from operational datasets often produces several tens of thousands, and even millions, of association rules. Moreover, many of these rules are redundant and thus useless. Using a semantic based on the closure of the Galois connection, we define a condensed representation for association rules. This representation is characterized by frequent closed itemsets and their generators. It contains the non-redundant association rules having minimal antecedent and maximal consequent, called min-max association rules. We think that these rules are the most relevant since they are the most general non-redundant association rules. Furthermore, this representation is a basis, i.e., a generating set for all association rules, their supports and their confidences, and all of them can be retrieved needless accessing the data. We introduce algorithms for extracting this basis and for reconstructing all association rules. Results of experiments carried out on real datasets show the usefulness of this approach. In order to generate this basis when an algorithm for extracting frequent itemsets--such as Apriori for instance--is used, we also present an algorithm for deriving frequent closed itemsets and their generators from frequent itemsets without using the dataset.

206 citations


Proceedings Article
01 Jan 2005
TL;DR: WFIM generates more concise and important weighted frequent itemsets in large databases, particularly dense databases with low minimum support, by adjusting a minimum weight and a weight range.
Abstract: Researchers have proposed weighted frequent itemset mining algorithms that reflect the importance of items. The main focus of weighted frequent itemset mining concerns satisfying the downward closure property. All weighted association rule mining algorithms suggested so far have been based on the Apriori algorithm. However, pattern growth algorithms are more efficient than Apriori based algorithms. Our main approach is to push the weight constraints into the pattern growth algorithm while maintaining the downward closure property. In this paper, a weight range and a minimum weight constraint are defined and items are given different weights within the weight range. The weight and support of each item are considered separately for pruning the search space. The number of weighted frequent itemsets can be reduced by setting a weight range and a minimum weight, allowing the user to balance support and weight of itemsets. WFIM generates more concise and important weighted frequent itemsets in large databases, particularly dense databases with low minimum support, by adjusting a minimum weight and a weight range.

152 citations


Book ChapterDOI
18 May 2005
TL;DR: This work proposes “Apriori-Inverse”, a method of discovering sporadic rules by ignoring all candidate itemsets above a maximum support threshold, and shows that Apriori -Inverse finds all perfectly sporadic rules much more quickly than Aprioro.
Abstract: We define sporadic rules as those with low support but high confidence: for example, a rare association of two symptoms indicating a rare disease. To find such rules using the well-known Apriori algorithm, minimum support has to be set very low, producing a large number of trivial frequent itemsets. We propose “Apriori-Inverse”, a method of discovering sporadic rules by ignoring all candidate itemsets above a maximum support threshold. We define two classes of sporadic rule: perfectly sporadic rules (those that consist only of items falling below maximum support) and imperfectly sporadic rules (those that may contain items over the maximum support threshold). We show that Apriori-Inverse finds all perfectly sporadic rules much more quickly than Apriori. We also propose extensions to Apriori-Inverse to allow us to find some (but not necessarily all) imperfectly sporadic rules.

151 citations


Proceedings ArticleDOI
27 Nov 2005
TL;DR: A one pass algorithm for frequent item set mining is presented, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure, and can be easily extended to a two pass accurate algorithm.
Abstract: Frequent item set mining is a core data mining operation and has been extensively studied over the last decade. This paper takes a new approach for this problem and makes two major contributions. First, we present a one pass algorithm for frequent item set mining, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because our one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. Our two pass algorithm is very memory efficient, and allows mining of datasets with large number of distinct items and/or very low support levels. Our detailed experimental evaluation on synthetic and real datasets shows the following. First, our one pass algorithm is very accurate in practice. Second, our algorithm requires significantly lower memory than Manku and Motwani's one pass algorithm and the multi-pass Apriori algorithm. Our two pass algorithm outperforms Apriori and FP-tree when the number of distinct items is large and/or support levels are very low. In other cases, it is quite competitive, with possible exception of cases where the average length of frequent item sets is quite high.

147 citations


Proceedings ArticleDOI
18 Apr 2005
TL;DR: This work introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling in the Apriori algorithm.
Abstract: The Apriori algorithm is a popular correlation-based data mining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx Virtex-II Pro 100 provides a performance improvement that can be orders of magnitude faster than the state-of-the-art software implementations. The system is easily scalable and introduces an efficient "systolic injection" method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling.

101 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: In this paper, a trie-based APRIORI algorithm for mining frequent item sequences in a transactional database is investigated, mainly focusing on those that also arise in frequent itemset mining.
Abstract: In this paper we investigate a trie-based APRIORI algorithm for mining frequent item sequences in a transactional database. We examine the data structure, implementation and algorithmic features mainly focusing on those that also arise in frequent itemset mining. In our analysis we take into consideration modern processors' properties (memory hierarchies, prefetching, branch prediction, cache line size, etc.), in order to better understand the results of the experiments.

93 citations


Book ChapterDOI
01 Jan 2005
TL;DR: This chapter attempts to survey the most successful algorithms and techniques that try to solve the frequent set mining problem efficiently.
Abstract: Frequent sets lie at the basis of many Data Mining algorithms. As a result, hundreds of algorithms have been proposed in order to solve the frequent set mining problem. In this chapter, we attempt to survey the most successful algorithms and techniques that try to solve this problem efficiently.

91 citations


01 Jan 2005
TL;DR: Basic concepts of association rule discovery are reviewed including support, confidence, the apriori property, constraints and parallel algorithms.
Abstract: Association rules are ”if-then rules” with two measures which quantify the support and confidence of the rule for a given data set. Having their origin in market basked analysis, association rules are now one of the most popular tools in data mining. This popularity is to a large part due to the availability of efficient algorithms. The first and arguably most influential algorithm for efficient association rule discovery is Apriori. In the following we will review basic concepts of association rule discovery including support, confidence, the apriori property, constraints and parallel algorithms. The core consists of a review of the most important algorithms for association rule discovery. Some familiarity with concepts like predicates, probability, expectation and random variables is assumed.

82 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: Recursive elimination is an algorithm for finding frequent item sets, which is strongly inspired by the FP-growth algorithm and very similar to the H-mine algorithm, and can be written with relatively few lines of code.
Abstract: Recursive elimination is an algorithm for finding frequent item sets, which is strongly inspired by the FP-growth algorithm and very similar to the H-mine algorithm. It does its work without prefix trees or any other complicated data structures, processing the transactions directly. Its main strength is not its speed (although it is not slow, even outperforms Apriori and Eclat on some data sets), but the simplicity of its structure. Basically all the work is done in one simple recursive function, which can be written with relatively few lines of code.

Journal ArticleDOI
01 Mar 2005
TL;DR: This work argues that there is no need to search all the search space of candidate itemsets but rather let the database unveil its secrets as the customers use it, and proposes a system that acts like a search engine specifically implemented for making recommendations to the customers using techniques borrowed from Information Retrieval.
Abstract: The classic two-stepped approach of the Apriori algorithm and its descendants, which consisted of finding all large itemsets and then using these itemsets to generate all association rules has worked well for certain categories of data Nevertheless for many other data types this approach shows highly degraded performance and proves rather inefficientWe argue that we need to search all the search space of candidate itemsets but rather let the database unveil its secrets as the customers use it We propose a system that does not merely scan all possible combinations of the itemsets, but rather acts like a search engine specifically implemented for making recommendations to the customers using techniques borrowed from Information Retrieval

Book ChapterDOI
23 Aug 2005
TL;DR: A new algorithm for efficient generating large frequent candidate sets is proposed, which is called Matrix Algorithm, which generates a matrix which entries 1 or 0 by passing over the cruel database only once, and then the frequent Candidate sets are obtained from the resulting matrix.
Abstract: Finding association rules is an important data mining problem and can be derived based on mining large frequent candidate sets. In this paper, a new algorithm for efficient generating large frequent candidate sets is proposed, which is called Matrix Algorithm. The algorithm generates a matrix which entries 1 or 0 by passing over the cruel database only once, and then the frequent candidate sets are obtained from the resulting matrix. Finally association rules are mined from the frequent candidate sets. Numerical experiments and comparison with the Apriori Algorithm are made on 4 randomly generated test problems with small, middle and large sizes. Experiments results confirm that the proposed algorithm is more effective than Apriori Algorithm.

Book ChapterDOI
29 Mar 2005
TL;DR: Simulation results reveal that the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude between 0.2% and 2% minimum share thresholds.
Abstract: Itemset share has been proposed as a measure of the importance of itemsets for mining association rules. The value of the itemset share can provide useful information such as total profit or total customer purchased quantity associated with an itemset in database. The discovery of share-frequent itemsets does not have the downward closure property. Existing algorithms for discovering share-frequent itemsets are inefficient or do not find all share-frequent itemsets. Therefore, this study proposes a novel Fast Share Measure (FSM) algorithm to efficiently generate all share-frequent itemsets. Instead of the downward closure property, FSM satisfies the level closure property. Simulation results reveal that the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude between 0.2% and 2% minimum share thresholds.

Proceedings Article
01 Jan 2005
TL;DR: The concept of heavy itemset was introduced in this article, where the authors show that an itemset A is heavy (for given support and confidence values) if all possible association rules made up of items only in A are present.
Abstract: A well-known problem that limits the practical usage of association rule mining algorithms is the extremely large number of rules generated. Such a large number of rules makes the algorithms inefficient and makes it difficult for the end users to comprehend the discovered rules. We present the concept of a heavy itemset. An itemset A is heavy (for given support and confidence values) if all possible association rules made up of items only in A are present. We prove a simple necessary and sufficient condition for an itemset to be heavy. We present a formula for the number of possible rules for a given heavy itemset, and show that a heavy itemset compactly represents an exponential number of association rules. Along with two simple search algorithms, we present an efficient greedy algorithm to generate a collection of disjoint heavy itemsets in a given transaction database. We then present a modified apriori algorithm that starts with a given collection of disjoint heavy itemsets and discovers more heavy itemsets, not necessarily disjoint with the given ones.

Proceedings Article
14 Jul 2005
TL;DR: The paper presents an application of the Fuzzy Frequent Pattern Tree approach to the difficult problem of the discovery of fuzzy association rules between genes from massive gene expression measurements.
Abstract: A significant data mining issue is the effective discovery of association rules. The extraction of association rules faces the problem of the combinatorial explosion of the search space, and the loss of information by the discretization of values. The first problem is confronted effectively by the Frequent Pattern Tree approach of [10]. This approach avoids the candidate generation phase of Apriori like algorithms. But, the discretization of the values of the attributes (i.e. the "items") at the basic Frequent Pattern Tree approach implies a loss of information. This loss usually either deteriorates significantly the results, or constitues them completely intolerable. This work extends appropriately the Frequent Pattern Tree approach in the fuzzy domain. The presented Fuzzy Frequent Pattern Tree retains the efficiency of the crisp Frequent Pattern Tree, while at the same time the careful updating of the fuzzy sets at all the phases of the algorithm tries to preserve most of the original information at the data set. The paper presents an application of the Fuzzy Frequent Pattern Tree approach to the difficult problem of the discovery of fuzzy association rules between genes from massive gene expression measurements.

Book ChapterDOI
31 Aug 2005
TL;DR: A rough set based process by which a rule importance measure is calculated for association rules to select the most appropriate rules to reduce the number of rules generated and provide a measure of how important is a rule.
Abstract: Association rule algorithms often generate an excessive number of rules, many of which are not significant. It is difficult to determine which rules are more useful, interesting and important. We introduce a rough set based process by which a rule importance measure is calculated for association rules to select the most appropriate rules. We use ROSETTA software to generate multiple reducts. Apriori association rule algorithm is then applied to generate rule sets for each data set based on each reduct. Some rules are generated more frequently than the others among the total rule sets. We consider such rules as more important. We define rule importance as the frequency of an association rule among the rule sets. Rule importance is different from rule interestingness in that it does not consider the predefined knowledge on what kind of information is considered to be interesting. The experimental results show our method reduces the number of rules generated and at the same time provides a measure of how important is a rule.

Book ChapterDOI
31 Jul 2005
TL;DR: An improved algorithm, which is based on Apriori algorithm widely used in market basket analysis, is designed and implemented, and result indicated that this algorithm for mining association rules between behavior and personality is feasible and efficient.
Abstract: Discovering the relationship between behavior and personality of learner in the web-based learning environment is a key to guide learners in the learning process. This paper proposes a new concept called personality mining to find the “deep” personality through the observed data about the behavior. First, a learner model which includes personality model and behavior model is proposed. Second, we have designed and implemented an improved algorithm, which is based on Apriori algorithm widely used in market basket analysis, to identify the relationship. Third, we have discussed various issues like constructing the learner model, unifying the value domain of heterogeneous model attributes, and improving Apriori algorithm with decision domain. Experiment result indicated that this algorithm for mining association rules between behavior and personality is feasible and efficient. The algorithm has been used in a web-based learning environment developed at Xi'an Jiaotong University.

Book ChapterDOI
20 Dec 2005
TL;DR: Modification to the Apriori algorithm to compute locally frequent sets and periodic frequent Sets and periodic association rules, which helps to find association rules within the items in the frequent sets.
Abstract: The problem of finding association rules from a dataset is to find all possible associations that hold among the items, given a minimum support and confidence. This involves finding frequent sets first and then the association rules that hold within the items in the frequent sets. In temporal datasets as the time in which a transaction takes place is important we may find sets of items that are frequent in certain time intervals but not frequent throughout the dataset. These frequent sets may give rise to interesting rules but these can not be discovered if we calculate the supports of the item sets in the usual way. We call here these frequent sets locally frequent. Normally these locally frequent sets are periodic in nature. We propose modification to the Apriori algorithm to compute locally frequent sets and periodic frequent sets and periodic association rules.

Journal ArticleDOI
TL;DR: It is proved by theoretical analysis that FPMFI has superiority and it is revealed by experimental comparison that the performance of F PMFI is superior to that of the similar algorithm based on FP-Tree more than one time.
Abstract: During the process of mining maximal frequent item sets, when minimum support is little, superset checking is a kind of time-consuming and frequent operation in the mining algorithm. In this paper, a new algorithm FPMFI (frequent pattern tree for maximal frequent item sets) for mining maximal frequent item sets is proposed. It adopts a new superset checking method based on projection of the maximal frequent item sets, which efficiently reduces the cost of superset checking. In addition, FPMFI also compresses the conditional FP-Tree (frequent pattern tree) greatly by deleting the redundant information, which can reduce the cost of accessing the tree. It is proved by theoretical analysis that FPMFI has superiority and it is revealed by experimental comparison that the performance of FPMFI is superior to that of the similar algorithm based on FP-Tree more than one time.

Journal ArticleDOI
TL;DR: ExAMiner is a breadth-first algorithm that exploits the real synergy of antimonotone and monotone constraints: the total benefit is greater than the sum of the two individual benefits.
Abstract: The key point of this article is that, in frequent pattern mining, the most appropriate way of exploiting monotone constraints in conjunction with frequency is to use them in order to reduce the input data; this reduction in turn induces a stronger pruning of the search space of the problem. Following this intuition, we introduce ExAMiner, a breadth-first algorithm that exploits the real synergy of antimonotone and monotone constraints: the total benefit is greater than the sum of the two individual benefits. ExAMiner generalizes the basic idea of the preprocessing algorithm ExAnte (Bonchi et al. 2003(b)), embedding such ideas at all levels of an Apriori-like computation. The resulting algorithm is the generalization of the Apriori algorithm when a conjunction of monotone constraints is conjoined to the frequency antimonotone constraint. Experimental results confirm that this is, so far, the most efficient way of attacking the computational problem in analysis.

Proceedings ArticleDOI
27 Nov 2005
TL;DR: A thorough experimental study of datasets with respect to frequent item sets, exhibiting a new characterization of datasets and some invariants allowing to better predict the behavior of well known algorithms.
Abstract: The discovery of frequent patterns is a famous problem in data mining. While plenty of algorithms have been proposed during the last decade, only a few contributions have tried to understand the influence of datasets on the algorithms behavior. Being able to explain why certain algorithms are likely to perform very well or very poorly on some datasets is still an open question. In this setting, we describe a thorough experimental study of datasets with respect to frequent item sets. We study the distribution of frequent item sets with respect to item sets size together with the distribution of three concise representations: frequent closed, frequent free and frequent essential item sets. For each of them, we also study the distribution of their positive and negative borders whenever possible. From this analysis, we exhibit a new characterization of datasets and some invariants allowing to better predict the behavior of well known algorithms. The main perspective of this work is to devise adaptive algorithms with respect to dataset characteristics.

Book ChapterDOI
22 Jul 2005
TL;DR: The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering and provide an understandable description of the discovered clusters by the frequent terms sets.
Abstract: In this paper, a new text-clustering algorithm named Frequent Term Set-based Clustering (FTSC) is introduced. It uses frequent term sets to cluster texts. First, it extracts useful information from documents and inserts into databases. Then, it uses the Apriori algorithm based on association rules mining efficiently to discover the frequent items sets. Finally, it clusters the documents according to the frequent words in subsets of the frequent term sets. This algorithm can reduce the dimension of the text data efficiently for very large databases, thus it can improve the accuracy and speed of the clustering algorithm. The results of clustering texts by the FTSC algorithm cannot reflect the overlap of texts' classes. Based on the FTSC algorithm, an improved algorithm—Frequent Term Set-based Hierarchical Clustering algorithm (FTSHC) is given. This algorithm can determine the overlap of texts' classes by the overlap of the frequent words sets, and provide an understandable description of the discovered clusters by the frequent terms sets. The FTSC, FTSHC and K-Means algorithms are evaluated quantitatively by experiments. The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering.

01 Jan 2005
TL;DR: It is demonstrated that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed into collaborative filtering data, and the weighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does.
Abstract: This dissertation is a simulation study of factors and techniques involved in designing hyperlink recommender systems that recommend to users, web pages that past users with similar navigation behaviors found interesting. The methodology involves identification of pertinent factors or techniques, and for each one, addresses the following questions: (a) room for improvement; (b) better approach, if any; and (c) performance characteristics of the technique in environments that hyperlink recommender systems operate in. The following four problems are addressed: Web page classification. A new metric (PageRank × Inverse Links-to-Word count ratio) is proposed for classifying web pages as content or navigation, to help in the discovery of user navigation behaviors from web user access logs. Results of a small user study suggest that this metric leads to desirable results. Data mining. A new apriori algorithm for mining association rules from large databases is proposed. The new algorithm addresses the problem of scaling of the classical apriori algorithm by eliminating an expensive join step, and applying the apriori property to every row of the database. In this study, association rules show the correlation relationships between user navigation behaviors and web pages they find interesting. The new algorithm has better space complexity than the classical one, and better time efficiency under some conditions and comparable time efficiency under other conditions. Prediction models for user interests. We demonstrate that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed into collaborative filtering data. We investigate collaborative filtering prediction models based on two approaches for computating prediction scores: using simple averages and weighted averages. Our findings suggest that the weighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does. Clustering. Clustering techniques are frequently applied in the design of personalization systems. We studied the performance of the CLARANS clustering algorithm in high dimensional space in relation to the PAM and CLARA clustering algorithms. While CLARA had the best time performance, CLARANS resulted in clusters with the lowest intra-cluster dissimilarities, and so was most effective in this regard.

Proceedings ArticleDOI
15 Dec 2005
TL;DR: This paper describes an interactive graphical user interface tool that can be used to study two famous frequent itemset generation algorithms, namely, Apriori and Eclat, and provides a hands-on environment for doing so.
Abstract: This paper describes an interactive graphical user interface tool called Visual Apriori that can be used to study two famous frequent itemset generation algorithms, namely, Apriori and Eclat. Understanding the functional behavior of these two algorithms is critical for students taking a data mining course; and Visual Apriori provides a hands-on environment for doing so. Visual Apriori relies on active participation from the user, where one inputs a transactional database and the tool produces a tree-based frequent itemset generation animation for the algorithm chosen. Visual Apriori provides an effortless learning experience by featuring user-friendly and easy to understand controls.

Book ChapterDOI
09 Jul 2005
TL;DR: It is shown that AprioriSetsAndSequences produces rules that express significant temporal relationships that describe patterns of behavior observed in the data set, and inherently handles different levels of time granularity in the same data set.
Abstract: We introduce an algorithm for mining expressive temporal relationships from complex data. Our algorithm, AprioriSetsAndSequences (ASAS), extends the Apriori algorithm to data sets in which a single data instance may consist of a combination of attribute values that are nominal sequences, time series, sets, and traditional relational values. Data sets of this type occur naturally in many domains including health care, financial analysis, complex system diagnostics, and domains in which multi-sensors are used. AprioriSetsAndSequences identifies predefined events of interest in the sequential data attributes. It then mines for association rules that make explicit all frequent temporal relationships among the occurrences of those events and relationships of those events and other data attributes. Our algorithm inherently handles different levels of time granularity in the same data set. We have implemented AprioriSetsAndSequences within the Weka environment [1] and have applied it to computer performance, stock market, and clinical sleep disorder data. We show that AprioriSetsAndSequences produces rules that express significant temporal relationships that describe patterns of behavior observed in the data set.

Book ChapterDOI
03 Oct 2005
TL;DR: It is claimed, that when proper item order is used, complete pruning does not necessarily speed up Apriori, and in databases with certain characteristics, pruning increases run time significantly, and if complete pruned is applied, then an intersection-based technique not only results in a faster algorithm, but gets free closed-itemset selection concerning both memory consumption and run-time.
Abstract: In this paper we investigate the relationship between closed itemset mining, the complete pruning technique and item ordering in the Apriori algorithm. We claim, that when proper item order is used, complete pruning does not necessarily speed up Apriori, and in databases with certain characteristics, pruning increases run time significantly. We also show that if complete pruning is applied, then an intersection-based technique not only results in a faster algorithm, but we get free closed-itemset selection concerning both memory consumption and run-time.

Proceedings Article
13 Feb 2005
TL;DR: Two novel methods, the Bitmap-based GSP (BGSP) and the SM-Tree (State Machine-Tree) algorithms are presented as an enhancement of the GSP-based sequential pattern mining approach.
Abstract: Sequential pattern mining is a heavily researched area in the field of data mining with wide variety of applications The task of discovering frequent sequences is challenging, because the algorithm needs to process a combinatorially explosive number of possible sequences Most of the methods dealing with the sequential pattern mining problem are based on the approach of the traditional task of itemset mining, because the former can be interpreted as the generalization of the latter Several algorithms use a level-wise "candidate generate and test" approach, while others use projected databases to discover the frequent sequences In this paper a classification of the well-known sequence mining algorithm is presented Because each algorithm has its own advantages and drawbacks regarding the execution time and the memory requirements, and the exact aim of the algorithms differs as well, thus an exact ranking of the methods is omitted A basic level-wise algorithm, the GSP is described in detail Because the level-wise algorithms need less memory in general than the projection-based ones, an efficient implementation of the GSP algorithm is also suggested Two novel methods, the Bitmap-based GSP (BGSP) and the SM-Tree (State Machine-Tree) algorithms are presented as an enhancement of the GSP-based sequential pattern mining approach

Proceedings ArticleDOI
07 Nov 2005
TL;DR: This paper presents novel algorithm for mining frequent itemsets in sparse database, compared with existing algorithm the authors' algorithm has visible advantage, and the experimental results show that the algorithm is very promising.
Abstract: This paper presents novel algorithm for mining frequent itemsets in sparse database, compared with existing algorithm our algorithm has visible advantage. With this algorithm, the scans is less in transaction database, only one time in little and middle transaction database, and not more than two times in large database. In the algorithm, when the transaction database is scanned, the transaction items are saved in unit triplet, and the count of every transaction item is saved in 1-dimension array so that the frequent itemsets are generated in memory. So I/O spending is reduced greatly. The experimental results show that our algorithm is very promising.

Book ChapterDOI
Nevcihan Duru1
14 Sep 2005
TL;DR: A software (DMAP), which uses Apriori algorithm, which has been usually used for the market basket analysis, was used for analyzing a diabetic database.
Abstract: In recent days, mining information from large databases has been recognized by many researchers and many data mining techniques and systems have been developed. In this study, a software (DMAP), which uses Apriori algorithm, was developed. Apriori is an influential algorithm that used in data mining. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent item set properties. The software is used for discovering the social status of the diabetics. A diabetic database that belongsto faculty of medicine of Kocaeli University has been used. The software was executed on a database which has records of 66 patients for test purpose. In the literature, diabetic databases have been often analyzed by rough sets. In this paper, Apriori algorithm, which has been usually used for the market basket analysis, was used for analyzing a diabetic database.