Levelwise Search and Borders of Theories in KnowledgeDiscovery

doi:10.1023/A:1009796218281

Home
/
Papers
/
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Journal Article•DOI•

Levelwise Search and Borders of Theories in KnowledgeDiscovery

Heikki Mannila¹, Hannu Toivonen¹•Institutions (1)

University of Helsinki¹

31 Jan 1997-Data Mining and Knowledge Discovery (Kluwer Academic Publishers)-Vol. 1, Iss: 3, pp 241-258

TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.

read less

Abstract: One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ⊆ L determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Workflow mining: discovering process models from event logs

[...]

W.M.P. van der Aalst¹, Ton Weijters¹, Laura Maruster¹•Institutions (1)

Eindhoven University of Technology¹

01 Sep 2004-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new algorithm is presented to extract a process model from a so-called "workflow log" containing information about the workflow process as it is actually being executed and represent it in terms of a Petri net.

...read moreread less

Abstract: Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated time-consuming process and, typically, there are discrepancies between the actual workflow processes and the processes as perceived by the management. Therefore, we have developed techniques for discovering workflow models. The starting point for such techniques is a so-called "workflow log" containing information about the workflow process as it is actually being executed. We present a new algorithm to extract a process model from such a log and represent it in terms of a Petri net. However, we also demonstrate that it is not possible to discover arbitrary workflow processes. We explore a class of workflow processes that can be discovered. We show that the /spl alpha/-algorithm can successfully mine any workflow represented by a so-called SWF-net.

...read moreread less

1,953 citations

Workflow mining: discovering process models from event logs

[...]

W.M.P. van der Aalst, A.J.M.M. Weijters, Laura Maruster

01 Jan 2003

1,807 citations

Journal Article•DOI•

Discovery of Frequent Episodes in Event Sequences

[...]

Heikki Mannila¹, Hannu Toivonen¹, A. Inkeri Verkamo¹•Institutions (1)

University of Helsinki¹

31 Jan 1997-Data Mining and Knowledge Discovery

TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.

...read moreread less

Abstract: Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management.

...read moreread less

1,593 citations

Cites methods from "Levelwise Search and Borders of The..."

...The levelwise main algorithm has also been used successfully in the search of frequent sets (Agrawal et al., 1996); a generic levelwise algorithm and its analysis has been presented in Mannila and Toivonen (1997) ....
[...]

Book Chapter•DOI•

Discovering Frequent Closed Itemsets for Association Rules

[...]

Nicolas Pasquier¹, Yves Bastide¹, Rafik Taouil¹, Lotfi Lakhal¹•Institutions (1)

Blaise Pascal University¹

10 Jan 1999

TL;DR: This paper proposes a new algorithm, called A-Close, using a closure mechanism to find frequent closed itemsets, and shows that this approach is very valuable for dense and/or correlated data that represent an important part of existing databases.

...read moreread less

Abstract: In this paper, we address the problem of finding frequent itemsets in a database. Using the closed itemset lattice framework, we show that this problem can be reduced to the problem of finding frequent closed itemsets. Based on this statement, we can construct efficient data mining algorithms by limiting the search space to the closed itemset lattice rather than the subset lattice. Moreover, we show that the set of all frequent closed itemsets suffices to determine a reduced set of association rules, thus addressing another important data mining problem: limiting the number of rules produced without information loss. We propose a new algorithm, called A-Close, using a closure mechanism to find frequent closed itemsets. We realized experiments to compare our approach to the commonly used frequent itemset search approach. Those experiments showed that our approach is very valuable for dense and/or correlated data that represent an important part of existing databases.

...read moreread less

1,513 citations

Journal Article•DOI•

Mining sequential patterns by pattern-growth: the PrefixSpan approach

[...]

Jian Pei¹, Jiawei Han, Behzad Mortazavi-Asl¹, Jianyong Wang², Helen Pinto¹, Qiming Chen, U. Dayal³, Meichun Hsu - Show less +4 more•Institutions (3)

Simon Fraser University¹, University of Minnesota², IEEE Computer Society³

01 Nov 2004-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns, and shows that PrefixSpan outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE and is the fastest among all the tested algorithms.

...read moreread less

Abstract: Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.

...read moreread less

1,334 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Mining association rules between sets of items in large databases

[...]

Rakesh Agrawal¹, Tomasz Imielinski², Arun N. Swami¹•Institutions (2)

IBM¹, Rutgers University²

01 Jun 1993

TL;DR: An efficient algorithm is presented that generates all significant association rules between items in the database of customer transactions and incorporates buffer management and novel estimation and pruning techniques.

...read moreread less

Abstract: We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

...read moreread less

15,645 citations

"Levelwise Search and Borders of The..." refers background in this paper

..., the set Th(L; r; q) = f' 2 L j q(r; ') is trueg: Example 1 Given a relation r with n rows over binary-valued attributes R, an association rule [1] is an expression of the formX ) A, whereX R and A 2 R....
[...]
...Several algorithms for nding frequent sets have been presented [1, 2, 11, 14, 15, 16, 31, 35, 36, 37, 38]....
[...]

Book•

Model Theory

[...]

Chen Chung Chang, Howard Jerome Keisler

01 Jan 1966

3,954 citations

Proceedings Article•

Fast discovery of association rules

[...]

Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen¹, A. Inkeri Verkamo - Show less +1 more•Institutions (1)

Helsinki Institute for Information Technology¹

01 Feb 1996

2,649 citations

"Levelwise Search and Borders of The..." refers background or methods in this paper

...For example, in computations of frequent sets for association rules Step 5 uses only a negligible amount of time [2]....
[...]
...See [2, 14, 15, 31, 36, 38] for various implementation methods....
[...]
...Some straightforward lower bounds for the problem of nding all frequent sets are given in [2]....
[...]
...Several algorithms for nding frequent sets have been presented [1, 2, 11, 14, 15, 16, 31, 35, 36, 37, 38]....
[...]
...The approach has been used in various forms, for example in [2, 6, 7, 18, 20, 23]....
[...]

Book•

Knowledge Discovery in Databases

[...]

Gregory Piateski, William Frawley

01 Dec 1991

TL;DR: Knowledge Discovery in Databases brings together current research on the exciting problem of discovering useful and interesting knowledge in databases, which spans many different approaches to discovery, including inductive learning, bayesian statistics, semantic query optimization, knowledge acquisition for expert systems, information theory, and fuzzy 1 sets.

...read moreread less

Abstract: From the Publisher: Knowledge Discovery in Databases brings together current research on the exciting problem of discovering useful and interesting knowledge in databases. It spans many different approaches to discovery, including inductive learning, bayesian statistics, semantic query optimization, knowledge acquisition for expert systems, information theory, and fuzzy 1 sets. The rapid growth in the number and size of databases creates a need for tools and techniques for intelligent data understanding. Relationships and patterns in data may enable a manufacturer to discover the cause of a persistent disk failure or the reason for consumer complaints. But today's databases hide their secrets beneath a cover of overwhelming detail. The task of uncovering these secrets is called "discovery in databases." This loosely defined subfield of machine learning is concerned with discovery from large amounts of possible uncertain data. Its techniques range from statistics to the use of domain knowledge to control search. Following an overview of knowledge discovery in databases, thirty technical chapters are grouped in seven parts which cover discovery of quantitative laws, discovery of qualitative laws, using knowledge in discovery, data summarization, domain specific discovery methods, integrated and multi-paradigm systems, and methodology and application issues. An important thread running through the collection is reliance on domain knowledge, starting with general methods and progressing to specialized methods where domain knowledge is built in. Gregory Piatetski-Shapiro is Senior Member of Technical Staff and Principal Investigator of the Knowledge Discovery Project at GTELaboratories. William Frawley is Principal Member of Technical Staff at GTE and Principal Investigator of the Learning in Expert Domains Project.

...read moreread less

1,913 citations

Proceedings Article•

An Efficient Algorithm for Mining Association Rules in Large Databases

[...]

Ashoka Savasere, Edward Omiecinski, Shamkant B. Navathe

11 Sep 1995

TL;DR: This paper presents an efficient algorithm for mining association rules that is fundamentally different from known algorithms and not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases.

...read moreread less

Abstract: Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an efficient algorithm for mining association rules that is fundamentally different from known algorithms. Compared to previous algorithms, our algorithm not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the performance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was reduced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases.

...read moreread less

1,822 citations