scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

TL;DR: A novel frequent-pattern tree (FP-tree) structure is proposed, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and an efficient FP-tree-based mining method, FP-growth, is developed for mining the complete set of frequent patterns by pattern fragment growth.
Abstract: Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This paper proposes three novel tree structures to efficiently perform incremental and interactive HUP mining that can capture the incremental data without any restructuring operation, and shows that these tree structures are very efficient and scalable.
Abstract: Recently, high utility pattern (HUP) mining is one of the most important research issues in data mining due to its ability to consider the nonbinary frequency values of items in transactions and different profit values for every item. On the other hand, incremental and interactive data mining provide the ability to use previous data structures and mining results in order to reduce unnecessary calculations when a database is updated, or when the minimum threshold is changed. In this paper, we propose three novel tree structures to efficiently perform incremental and interactive HUP mining. The first tree structure, Incremental HUP Lexicographic Tree (IHUPL-Tree), is arranged according to an item's lexicographic order. It can capture the incremental data without any restructuring operation. The second tree structure is the IHUP transaction frequency tree (IHUPTF-Tree), which obtains a compact size by arranging items according to their transaction frequency (descending order). To reduce the mining time, the third tree, IHUP-transaction-weighted utilization tree (IHUPTWU-Tree) is designed based on the TWU value of items in descending order. Extensive performance analyses show that our tree structures are very efficient and scalable for incremental and interactive HUP mining.

555 citations


Cites background from "Mining Frequent Patterns without Ca..."

  • ...In the following sections, we will discuss research on frequent pattern mining, weighted frequent pattern mining, high utility pattern mining, and incremental and interactive pattern mining....

    [...]

  • ...…Farhan Ahmed, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong, and Young-Koo Lee, Member, IEEE Abstract—Recently, high utility pattern (HUP) mining is one of the most important research issues in data mining due to its ability to consider the nonbinary frequency values of items in transactions and…...

    [...]

Proceedings ArticleDOI
29 Oct 2012
TL;DR: This paper proposes an algorithm, called HUI-Miner (High Utility Itemset Miner), which can efficiently mine high utility itemsets from the utility-lists constructed from a mined database and compares it with the state-of-the-art algorithms on various databases.
Abstract: High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many real-life applications and is an important research issue in data mining area. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called HUI-Miner (High Utility Itemset Miner), for high utility itemset mining. HUI-Miner uses a novel structure, called utility-list, to store both the utility information about an itemset and the heuristic information for pruning the search space of HUI-Miner. By avoiding the costly generation and utility computation of numerous candidate itemsets, HUI-Miner can efficiently mine high utility itemsets from the utility-lists constructed from a mined database. We compared HUI-Miner with the state-of-the-art algorithms on various databases, and experimental results show that HUI-Miner outperforms these algorithms in terms of both running time and memory consumption.

539 citations

Journal ArticleDOI
TL;DR: A novel FP-array technique is presented that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms and works especially well for sparse data sets.
Abstract: Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. Methods for mining frequent itemsets have been implemented using a prefix-tree structure, known as an FP-tree, for storing compressed information about frequent itemsets. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this paper, we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms. Our technique works especially well for sparse data sets. Furthermore, we present new algorithms for mining all, maximal, and closed frequent itemsets. Our algorithms use the FP-tree data structure in combination with the FP-array technique efficiently and incorporate various optimization techniques. We also present experimental results comparing our methods with existing algorithms. The results show that our methods are the fastest for many cases. Even though the algorithms consume much memory when the data sets are sparse, they are still the fastest ones when the minimum support is low. Moreover, they are always among the fastest algorithms and consume less memory than other methods when the data sets are dense.

527 citations


Cites background or methods from "Mining Frequent Patterns without Ca..."

  • ...Furthermore, when the FP-tree is too large to fit in memory, the current solutions need a very large number of disk I/Os for reading and writing FP-trees onto secondary memory or generating many intermediate databases [15], which makes mining frequent itemsets too time-consuming....

    [...]

  • ...In the first set of experiments, we studied the performance of FPgrowth* by comparing it with the original FP-growth method [14], [15], kDCI [23], dEclat [35], Apriori [5], [6], and PatriciaMine [27]....

    [...]

  • ...The FP-growthmethod [14], [15] is a depth-first algorithm....

    [...]

Journal Article
TL;DR: SPMF is an open-source data mining library offering implementations of more than 55 data mining algorithms, specialized for discovering patterns in transaction and sequence databases such as frequent itemsets, association rules and sequential patterns.
Abstract: We present SPMF, an open-source data mining library offering implementations of more than 55 data mining algorithms. SPMF is a cross-platform library implemented in Java, specialized for discovering patterns in transaction and sequence databases such as frequent itemsets, association rules and sequential patterns. The source code can be integrated in other Java programs. Moreover, SPMF offers a command line interface and a simple graphical interface for quick testing. The source code is available under the GNU General Public License, version 3. The website of the project offers several resources such as documentation with examples of how to run each algorithm, a developer's guide, performance comparisons of algorithms, data sets, an active forum, a FAQ and a mailing list.

417 citations


Cites background or methods from "Mining Frequent Patterns without Ca..."

  • ...Keywords: data mining, library, frequent pattern mining, sequence database, transaction database, open-source...

    [...]

  • ...For example, only three algorithms from SPMF appear in Weka and Knime (Apriori, FPGrowth and GSP), only one in Mahout (FPGrowth), two in LUCS-KDD (Apriori, FPGrowth), and eight in Coron....

    [...]

  • ...Weka, Knime and Mahout c©2014 Philippe Fournier-Viger, Antonio Gomariz, Ted Gueniche, Azadeh Soltani, Cheng-Wei Wu, Vincent S. Tseng. offer only a few popular pattern mining algorithms such as Apriori (Agrawal and Srikant, 1994), GSP (Srikant et al., 1996) and FPGrowth (Han et al., 2004)....

    [...]

  • ...For these three classical data mining tasks with wide applications, SPMF offers implementations of popular algorithms such as Apriori, Eclat (Zaki, M. J.), FPGrowth, GSP, PrefixSpan (Pei et al., 2004), SPAM (Ayres et al., 2000) and BIDE (Wang et al., 2007)....

    [...]

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper presents an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance that generalize to several weighted and unweighted measures of partial word overlap between sets.
Abstract: In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets.

376 citations

References
More filters
Proceedings ArticleDOI
01 Jun 1993
TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.
Abstract: We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

15,645 citations


"Mining Frequent Patterns without Ca..." refers background in this paper

  • ...Frequent-pattern mining plays an essential role in mining associations (Agrawal et al., 1993, 1996; Agrawal and Srikant, 1994; Mannila et al., 1994), correlations (Brin et al., 1997), causality (Silverstein et al., 1998), sequential patterns (Agrawal and Srikant, 1995), episodes (Mannila et al.,…...

    [...]

Proceedings Article
01 Jul 1998
TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.
Abstract: We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving thii problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.

10,863 citations

Proceedings Article
12 Sep 1994

10,454 citations


"Mining Frequent Patterns without Ca..." refers background or methods in this paper

  • ...Frequent-pattern mining plays an essential role in mining associations (Agrawal et al., 1993, 1996; Agrawal and Srikant, 1994; Mannila et al., 1994), correlations (Brin et al., 1997), causality (Silverstein et al., 1998), sequential patterns (Agrawal and Srikant, 1995), episodes (Mannila et al.,…...

    [...]

  • ...Most of the previous studies, such as Agrawal and Srikant (1994), Mannila et al. (1994), Agrawal et al. (1996), Savasere et al. (1995), Park et al. (1995), Lent et al. (1997), Sarawagi et al. (1998), Srikant et al. (1997), Ng et al. (1998) and Grahne et al. (2000), adopt an Apriori-like approach,…...

    [...]

  • ...A performance study has been conducted to compare the performance of FP-growth with two representative frequent-pattern mining methods, Apriori (Agrawal and Srikant, 1994) and TreeProjection (Agarwal et al., 2001)....

    [...]

  • ...The synthetic data sets which we used for our experiments were generated using the procedure described in Agrawal and Srikant (1994)....

    [...]

  • ...…Srikant et al. (1997), Ng et al. (1998) and Grahne et al. (2000), adopt an Apriori-like approach, which is based on the anti-monotone Apriori heuristic (Agrawal and Srikant, 1994): if any length k pattern is not frequent in the database, its length (k + 1) super-pattern can never be frequent....

    [...]

Journal ArticleDOI
16 May 2000
TL;DR: This study proposes a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develops an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.
Abstract: Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns.In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.

6,118 citations

Proceedings ArticleDOI
06 Mar 1995
TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. >

5,663 citations


"Mining Frequent Patterns without Ca..." refers background or methods in this paper

  • ...Most of the previously developed sequential pattern mining methods, such as Agrawal and Srikant (1995), Srikant and Agrawal (1996) and Mannila et al. (1997), follow the methodology of Apriori since the Apriori-based method may substantially reduce the number of combinations to be examined....

    [...]

  • ..., 1998), sequential patterns (Agrawal and Srikant, 1995), episodes (Mannila et al....

    [...]

  • ...…1996; Agrawal and Srikant, 1994; Mannila et al., 1994), correlations (Brin et al., 1997), causality (Silverstein et al., 1998), sequential patterns (Agrawal and Srikant, 1995), episodes (Mannila et al., 1997), multi-dimensional patterns (Lent et al., 1997; Kamber et al., 1997), max-patterns…...

    [...]