scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A survey of itemset mining

TL;DR: An up‐to‐date survey of itemset mining problems and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining are discussed.
Abstract: Itemset mining is an important subfield of data mining, which consists of discovering interesting and useful patterns in transaction databases. The traditional task of frequent itemset mining is to discover groups of items (itemsets) that appear frequently together in transactions made by customers. Although itemset mining was designed for market basket analysis, it can be viewed more generally as the task of discovering groups of attribute values frequently cooccurring in databases. Because of its numerous applications in domains such as bioinformatics, text mining, product recommendation, e-learning, and web click stream analysis, itemset mining has become a popular research area. This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field. The problem of frequent itemset mining and its applications are described. Moreover, main approaches and strategies to solve itemset mining problems are presented, as well as their characteristics are provided. Limitations of traditional frequent itemset mining approaches are also highlighted, and extensions of the task of itemset mining are presented such as high-utility itemset mining, rare itemset mining, fuzzy itemset mining, and uncertain itemset mining. This study also discusses research opportunities and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining. Main open-source libraries of itemset mining implementations are also briefly presented. WIREs Data Mining Knowl Discov 2017, 7:e1207. doi: 10.1002/widm.1207
Citations
More filters
Journal ArticleDOI
TL;DR: An in-depth survey of the current status of parallel SPM (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state-of-the art PSPM.
Abstract: With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this article, an in-depth survey of the current status of parallel SPM (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state-of-the art PSPM. We review the related work of PSPM in details including partition-based algorithms for PSPM, apriori-based PSPM, pattern-growth-based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages, and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative/weighted/utility SPM, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

188 citations


Cites background from "A survey of itemset mining"

  • ...KDD has numerous real-life applications and is crucial to some of the most fundamental tasks such as frequent itemset and association rule mining [3], [4], [6], sequential pattern mining [5], [7], [8], clustering [9], [10], classification [11], outline detection [12]....

    [...]

  • ...or association rule mining (ARM) has attracted a lot of attention [1], [3], [4], [6], [13], [14], [17]....

    [...]

  • ..., FIM, ARM and SPM) has been extensively studied and successfully applied in many fields [6], [8]....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an up‐to‐date survey of the state‐of‐the‐art iHUIM algorithms, including Apriori‐based, tree‐ based, and utility‐list‐based approaches, and identifies several important issues and research challenges for iH UIM.
Abstract: Traditional association rule mining has been widely studied. But it is unsuitable for real-world applications where factors such as unit profits of items and purchase quantities must be considered. High-utility itemset mining HUIM is designed to find highly profitable patterns by considering both the purchase quantities and unit profits of items. However, most HUIM algorithms are designed to be applied to static databases. But in real-world applications such as market basket analysis and business decision-making, databases are often dynamically updated by inserting new data such as customer transactions. Several researchers have proposed algorithms to discover high-utility itemsets HUIs in dynamically updated databases. Unlike batch algorithms, which always process a database from scratch, incremental high-utility itemset mining iHUIM algorithms incrementally update and output HUIs, thus reducing the cost of discovering HUIs. This paper provides an up-to-date survey of the state-of-the-art iHUIM algorithms, including Apriori-based, tree-based, and utility-list-based approaches. To the best of our knowledge, this is the first survey on the mining task of incremental high-utility itemset mining. The paper also identifies several important issues and research challenges for iHUIM. WIREs Data Mining Knowl Discov 2018, 8:e1242. doi: 10.1002/widm.1242

149 citations


Cites background from "A survey of itemset mining"

  • ..., 2012), but can also serve as inspiration for other data mining tasks (Fournier-Viger et al., 2017), including incremental data mining (Hong et al....

    [...]

  • ...…are not only important for iHUIM (Ahmed et al., 2009; Fournier-Viger et al., 2015; Lin et al., 2012), but can also serve as inspiration for other data mining tasks (Fournier-Viger et al., 2017), including incremental data mining (Hong et al., 2001) and dynamic data mining (Lin et al., 2009)....

    [...]

  • ...Two fundamental tasks for revealing interesting relationships between items in transactional databases are frequent itemset mining (FIM) and association rule mining (ARM) (Agrawal, Imielinski, & Swami, 1993; Chen, Han, & Yu, 1996; Fournier-Viger et al., 2017)....

    [...]

Journal ArticleDOI
TL;DR: An in-depth understanding of UPM is introduced, including concepts, examples, and comparisons with related concepts, and a comprehensive review of advanced topics of existing high-utility pattern mining techniques is offered, with a discussion of their pros and cons.
Abstract: The main purpose of data mining and analytics is to find novel, potentially useful patterns that can be utilized in real-world applications to derive beneficial knowledge. For identifying and evaluating the usefulness of different kinds of patterns, many techniques and constraints have been proposed, such as support, confidence, sequence order, and utility parameters (e.g., weight, price, profit, quantity, satisfaction, etc.). In recent years, there has been an increasing demand for utility-oriented pattern mining (UPM, or called utility mining). UPM is a vital task, with numerous high-impact applications, including cross-marketing, e-commerce, finance, medical, and biomedical applications. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods of UPM. First, we introduce an in-depth understanding of UPM, including concepts, examples, and comparisons with related concepts. A taxonomy of the most common and state-of-the-art approaches for mining different kinds of high-utility patterns is presented in detail, including Apriori-based, tree-based, projection-based, vertical-/horizontal-data-format-based, and other hybrid approaches. A comprehensive review of advanced topics of existing high-utility pattern mining techniques is offered, with a discussion of their pros and cons. Finally, we present several well-known open-source software packages for UPM. We conclude our survey with a discussion on open and practical challenges in this field.

140 citations

Journal ArticleDOI
TL;DR: This work analyzes how this task has been considered during the last decades by considering centralized systems as well as parallel (shared or nonshared memory) architectures and solutions can be divided into exhaustive search and nonexhaustive search models.
Abstract: Frequent itemset mining (FIM) is an essential task within data analysis since it is responsible for extracting frequently occurring events, patterns, or items in data. Insights from such pattern analysis offer important benefits in decision‐making processes. However, algorithmic solutions for mining such kind of patterns are not straightforward since the computational complexity exponentially increases with the number of items in data. This issue, together with the significant memory consumption that is present in the mining process, makes it necessary to propose extremely efficient solutions. Since the FIM problem was first described in the early 1990s, multiple solutions have been proposed by considering centralized systems as well as parallel (shared or nonshared memory) architectures. Solutions can also be divided into exhaustive search and nonexhaustive search models. Many of such approaches are extensions of other solutions and it is therefore necessary to analyze how this task has been considered during the last decades.

122 citations


Cites background from "A survey of itemset mining"

  • ...While some reviews have been already proposed in literature (Chee, Jaafar, Aziz, Hasan, & Yeoh, 2018; Fournier-Viger et al., 2017), they are mainly focused on sequential exhaustive search approaches and on describing the algorithms for nonexpert users....

    [...]

Journal ArticleDOI
TL;DR: A new cluster-based information retrieval approach named ICIR (Intelligent Cluster-based Information Retrieval) is proposed, which combines k-means clustering with frequent closed itemset mining to extract clusters of documents and find frequent terms in each cluster.

69 citations

References
More filters
Proceedings Article
01 Jul 1998
TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.
Abstract: We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving thii problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.

10,863 citations

01 Jan 2002

9,314 citations


"A survey of itemset mining" refers background in this paper

  • ...The goal of data mining is to predict the future or to understand the past.(1,2) Techniques used for pre-...

    [...]

  • ...Some common types of patterns found in databases are clusters, itemsets, trends, and outliers.(2) This...

    [...]

Journal ArticleDOI
TL;DR: A novel frequent-pattern tree (FP-tree) structure is proposed, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and an efficient FP-tree-based mining method, FP-growth, is developed for mining the complete set of frequent patterns by pattern fragment growth.
Abstract: Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.

2,567 citations

Proceedings ArticleDOI
09 Dec 2002
TL;DR: A novel algorithm called gSpan (graph-based substructure pattern mining), which discovers frequent substructures without candidate generation by building a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label.
Abstract: We investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based substructure pattern mining), which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order gSpan adopts the depth-first search strategy to mine frequent connected subgraphs efficiently. Our performance study shows that gSpan substantially outperforms previous algorithms, sometimes by an order of magnitude.

2,282 citations

Journal ArticleDOI
TL;DR: Efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the association mining task are presented and the effect of using different database layout schemes combined with the proposed decomposition and traverse techniques are presented.
Abstract: Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. We present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases.

1,637 citations