scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Levelwise Search and Borders of Theories in KnowledgeDiscovery

31 Jan 1997-Data Mining and Knowledge Discovery (Kluwer Academic Publishers)-Vol. 1, Iss: 3, pp 241-258
TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.
Abstract: One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ⊆ L determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A new algorithm is presented to extract a process model from a so-called "workflow log" containing information about the workflow process as it is actually being executed and represent it in terms of a Petri net.
Abstract: Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated time-consuming process and, typically, there are discrepancies between the actual workflow processes and the processes as perceived by the management. Therefore, we have developed techniques for discovering workflow models. The starting point for such techniques is a so-called "workflow log" containing information about the workflow process as it is actually being executed. We present a new algorithm to extract a process model from such a log and represent it in terms of a Petri net. However, we also demonstrate that it is not possible to discover arbitrary workflow processes. We explore a class of workflow processes that can be discovered. We show that the /spl alpha/-algorithm can successfully mine any workflow represented by a so-called SWF-net.

1,953 citations

Journal ArticleDOI
TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.
Abstract: Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management.

1,593 citations


Cites methods from "Levelwise Search and Borders of The..."

  • ...The levelwise main algorithm has also been used successfully in the search of frequent sets (Agrawal et al., 1996); a generic levelwise algorithm and its analysis has been presented in Mannila and Toivonen (1997) ....

    [...]

Book ChapterDOI
10 Jan 1999
TL;DR: This paper proposes a new algorithm, called A-Close, using a closure mechanism to find frequent closed itemsets, and shows that this approach is very valuable for dense and/or correlated data that represent an important part of existing databases.
Abstract: In this paper, we address the problem of finding frequent itemsets in a database. Using the closed itemset lattice framework, we show that this problem can be reduced to the problem of finding frequent closed itemsets. Based on this statement, we can construct efficient data mining algorithms by limiting the search space to the closed itemset lattice rather than the subset lattice. Moreover, we show that the set of all frequent closed itemsets suffices to determine a reduced set of association rules, thus addressing another important data mining problem: limiting the number of rules produced without information loss. We propose a new algorithm, called A-Close, using a closure mechanism to find frequent closed itemsets. We realized experiments to compare our approach to the commonly used frequent itemset search approach. Those experiments showed that our approach is very valuable for dense and/or correlated data that represent an important part of existing databases.

1,513 citations

Journal ArticleDOI
TL;DR: This paper proposes a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns, and shows that PrefixSpan outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE and is the fastest among all the tested algorithms.
Abstract: Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.

1,334 citations

References
More filters
Proceedings ArticleDOI
01 Jun 1993
TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.
Abstract: We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

15,645 citations


"Levelwise Search and Borders of The..." refers background in this paper

  • ..., the set Th(L; r; q) = f' 2 L j q(r; ') is trueg: Example 1 Given a relation r with n rows over binary-valued attributes R, an association rule [1] is an expression of the formX ) A, whereX R and A 2 R....

    [...]

  • ...Several algorithms for nding frequent sets have been presented [1, 2, 11, 14, 15, 16, 31, 35, 36, 37, 38]....

    [...]

Book
01 Jan 1966

3,954 citations

Proceedings Article
01 Feb 1996

2,649 citations


"Levelwise Search and Borders of The..." refers background or methods in this paper

  • ...For example, in computations of frequent sets for association rules Step 5 uses only a negligible amount of time [2]....

    [...]

  • ...See [2, 14, 15, 31, 36, 38] for various implementation methods....

    [...]

  • ...Some straightforward lower bounds for the problem of nding all frequent sets are given in [2]....

    [...]

  • ...Several algorithms for nding frequent sets have been presented [1, 2, 11, 14, 15, 16, 31, 35, 36, 37, 38]....

    [...]

  • ...The approach has been used in various forms, for example in [2, 6, 7, 18, 20, 23]....

    [...]

Book
01 Dec 1991
TL;DR: Knowledge Discovery in Databases brings together current research on the exciting problem of discovering useful and interesting knowledge in databases, which spans many different approaches to discovery, including inductive learning, bayesian statistics, semantic query optimization, knowledge acquisition for expert systems, information theory, and fuzzy 1 sets.
Abstract: From the Publisher: Knowledge Discovery in Databases brings together current research on the exciting problem of discovering useful and interesting knowledge in databases. It spans many different approaches to discovery, including inductive learning, bayesian statistics, semantic query optimization, knowledge acquisition for expert systems, information theory, and fuzzy 1 sets. The rapid growth in the number and size of databases creates a need for tools and techniques for intelligent data understanding. Relationships and patterns in data may enable a manufacturer to discover the cause of a persistent disk failure or the reason for consumer complaints. But today's databases hide their secrets beneath a cover of overwhelming detail. The task of uncovering these secrets is called "discovery in databases." This loosely defined subfield of machine learning is concerned with discovery from large amounts of possible uncertain data. Its techniques range from statistics to the use of domain knowledge to control search. Following an overview of knowledge discovery in databases, thirty technical chapters are grouped in seven parts which cover discovery of quantitative laws, discovery of qualitative laws, using knowledge in discovery, data summarization, domain specific discovery methods, integrated and multi-paradigm systems, and methodology and application issues. An important thread running through the collection is reliance on domain knowledge, starting with general methods and progressing to specialized methods where domain knowledge is built in. Gregory Piatetski-Shapiro is Senior Member of Technical Staff and Principal Investigator of the Knowledge Discovery Project at GTELaboratories. William Frawley is Principal Member of Technical Staff at GTE and Principal Investigator of the Learning in Expert Domains Project.

1,913 citations

Proceedings Article
11 Sep 1995
TL;DR: This paper presents an efficient algorithm for mining association rules that is fundamentally different from known algorithms and not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases.
Abstract: Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an efficient algorithm for mining association rules that is fundamentally different from known algorithms. Compared to previous algorithms, our algorithm not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the performance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was reduced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases.

1,822 citations