scispace - formally typeset
Open AccessProceedings ArticleDOI

A fast distributed algorithm for mining association rules

TLDR
In this article, a fast distributed mining of association rules (FDM) algorithm is proposed to generate a small number of candidate sets and substantially reduce the number of messages to be passed at mining association rules.
Abstract
With the existence of many large transaction databases, the huge amounts of data, the high scalability of distributed systems, and the easy partitioning and distribution of a centralized database, it is important to investigate efficient methods for distributed mining of association rules. The study discloses some interesting relationships between locally large and globally large item sets and proposes an interesting distributed association rule mining algorithm, FDM (fast distributed mining of association rules), which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules. A performance study shows that FDM has a superior performance over the direct application of a typical sequential algorithm. Further performance enhancement leads to a few variations of the algorithm.

read more

Content maybe subject to copyright    Report

A Fast Distributed Algorithm for Mining Association
Rules
*
David
W.
Cheungt Jiawei Hans Vincent
T.
Ngtt
Ada
W.
Fuss
Yongjian
FuI
t
Department of Computer Science, The University of Hong Kong, Hong Kong. Email: dcheungOcs.hku.hk.
*
School of Computing Science, Simon Fraser University, Canada. Email: hanOcs.sfu.ca.
tt
Department of Computing, Hong Kong Polytechnic University, Hong Kong. Email:
cstyngOcomp.po1yu.edu.hk.
$4
Department of Computer Science and Engineering, The Chinese University of Hong Kong,Hong Kong. Email: adafuOcs.cuhk.hk.
Abstract
With the existence
of
many large transaction
databases, the huge amounts
of
data,
the high scal-
ability
of
distributed systems, and the easy partition
and distribution
of
a centralized database, it is im-
portant to inuestzgate eficient methods for distributed
mining
of
association rules. This study discloses some
interesting relationships between locally large and glob-
ally large itemsets and proposes an interesting dis-
tributed association rule mining algorithm, FDM
(Fast
Distributed Mining of association rules),
which gener-
ates
a
small number
of
candidate sets and substantially
reduces the number
of
messages to be passed at min-
ing association rules. Our performance study shows
that FDM has a superior performance over the direct
application
of
a
typical sequential algorithm. Further
performance enhancement leads to
a
few variations
of
the algorithm.
1
Introduction
An association rule is a rule which implies certain
association relationships among a set of objects (such
as “occur together”
or
“one implies the other”) in
a
database. Since finding interesting association rules
in databases may disclose some useful patterns for
decision support, selective marketing, financial fore-
cast, medical diagnosis, and many other applications,
it has attracted a lot of attention in recent data min-
ing research
[5].
Mining association rules may require
iterative scanning
of
large transaction
or
relational
databases which is quite costly in processing. There-
fore, efficient mining of association rules in transaction
and/or relational databases has been studied substan-
tially
[l,
2, 4,
8,
10, 11, 12, 14,
151.
*The research of the first author was supported in part
by RGC (the Hong Kong Research Grants Council) grant
338/065/0026. The research of the second author was supported
in part by the research grant NSERC-A3723 from the Natural
Sciences and Engineering Research Council of Canada, the re-
search grant NCE:IRIS/PRECARN-HMI5 from the Networks
of
Centres of Excellence of Canada, and a research grant
from
Hughes Research Laboratories.
Previous studies examined efficient mining of asso-
ciation rules from many different angles. An influen-
tial association rule mining algorithm, Apriori
[2],
has
been developed
for
rule mining in large transaction
databases. A DHP algorithm
[lo]
is an extension of
Apriori using
a
hashing technique. The scope of the
study has also been extended to efficient mining of se-
quential patterns
[3],
generalized association rules
[14],
multiple-level association rules
[8],
quantitative asso-
ciation rules
[15],
etc. Maintenance of discovered
asso-
ciation rules by incremental updating has been studied
in
[4].
Although these studies are on sequential data
mining techniques, algorithms for parallel mining of
association rules have been proposed recently
[ll,
11.
We feel that the development of distributed algo-
rithms for efficient mining of association rules has its
unique importance, based on the following reasoning.
(1)
Databases
or
data warehouses
[13]
may store a
huge amount of data. Mining association rules in such
databases may require substantial processing power,
and distributed system is
a
possible solution.
(2)
Many large databases are distributed in nature.
For
example, the huge number of transaction records of
hundreds of Sears department stores are likely to be
stored at different sites. This observation motivates
us to study efficient distributed algorithms for min-
ing association rules in databases. This study may
also shed new light on parallel data mining. Further-
more, a distributed mining algorithm can also be used
to mine association rules in a single large database
by partitioning the database among a set of sites and
processing the task in a distributed manner. The high
flexibility, scalability, low cost performance ratio, and
easy connectivity of a distributed system makes it an
ideal platform for mining association rules.
In this study, we assume that the database to be
studied is a transaction database although the method
can be easily extended to relational databases as well.
The
database
consists
of
a
huge number
of
transac-
tion records, each with a transaction identifier (TID)
and
a
set of data items. Further, we assume that the
0-8186-7475-X/96
$5.00
0
1996
IEEE
31

database is “horizontally” partitioned (i.e., grouped
by transactions) and allocated to the sites in a dis-
tributed system which communicate by message pass-
ing. Based on these assumptions, we examine dis-
tributed mining of association rules. It has been well
known that the major cost of mining association rules
is the computation of the set of
large itemsets
(i.e.,
fre-
quently occurring sets
of
items,
see Section 2.1) in the
database
[2].
Distributed computing of large itemsets
encounters some new problems. One may compute
lo-
cally large
itemsets easily, but a locally large itemset
may not bc
globally large.
Since it is very expensive
to broadcast the whole data set to other sites, one op-
tion is to broadcast all the counts of all the itemsets,
no matter locally large
or
small, to other sites. How-
ever
,
a database may contain enormous combinations
of itemsets, and it will involve passing a huge number
of messages.
Based on
our
observation, there exist some interest-
ing properties between locally large and globally large
itemsets. One should maximally take advantages of
such properties to reduce the number of messages to
be passed and confine the substantial amount of pro-
cessing to local sites. As mentioned before, two al-
gorithms for parallel mining
of
association rules have
been proposed. The two proposed algorithms PDM
and Count Distribution (CD) are designed for share-
nothing parallel systems
[ll,
13.
However, they can
also be adapted
to
distributed environment. We have
proposed an efficient distributed data mining algo-
rithm FDM
(Fast Distributed Mining
of
associatzon
rules),
which has the following distinct feature in com-
parison with these two proposed parallel mining algo-
rithms.
1.
The generation of candidate sets is in the same
spirit of Apriori. However, some interesting rela-
tionships between locally large sets and globally
large ones are explored to generate
a
smaller set of
candidate sets
at
each iteration and thus reduce
the number of messages to be passed.
2.
After the candidate sets have been generated, two
pruning techniques,
local pruning
and
global prun-
ing,
are developed to prune away some candidate
sets at each individual sites.
3.
In order
to
determine whether
a
candidate set is
large,
our
algorithm requires only
O(n)
messages
for support count exchange, where
n
is the num-
ber
of
sites in the network. This is much less than
a
straight adaptation of Apriori, which requires
O(n2)
messages.
Notice that several different combinations of the
local and global prunings can be adopted in FDM.
We studied three versions of FDM:
FDM-LP, FDM-
LUP,
and
FDM-LPP
(see Section
4),
with similar
framework but different combinations
of
pruning tech-
niques. FDM-LP only explores the
local prunzng;
FDM-LUP
does
both local pruning and the
upper-
bound-prunzng;
and FDM-LPP does both local prun-
ing and the
pollang-szte-prunang.
Extensive experiments have been conducted to
study the performance of FDM and compare it against
the Count Distribution algorithm. The study demon-
strates the efficiency of the distributed mining algo-
rithm.
The remaining of the paper is organized as follows.
The tasks of mining association rules in sequential as
well as distributed environments are defined in Sec-
tion 2. In Section
3,
techniques
for
distributed mining
of association rules and some important results are dis-
cussed. The algorithms for different versions of FDM
are presented in Section
4.
A performance study is re-
ported in Section
5.
Our
discussions and conclusions
are presented respectively in Sections
6
and
7.
2
Problem
Definition
2.1
Sequential Algorithm
for
Mining As-
sociation Rules
Let
I
=
{il,i2,.
.
.,im}
be a set of
atems.
Let
DB
be a database
of
transactions, where each transaction
T
consists of a set of items such that
T
C
I.
Given an
ztemset
X
C
I,
a transaction
T
contazns
X
if and only
if
X
T.
An
assocaatzon rule
is an implication of the
form
X
a
Y,
where
X
C_
I,
Y
2
I
and
X
n
Y
=
0.
The association rule
X
j
Y
holds in
DB
with
confi-
dence
c
if the probability of a transaction in
DB
which
contains
X
also contains
Y
is
e.
The association rule
X
Y
has
support
s
in
DB
if the probability of a
transaction in
DB
contains both
X
and
Y
is
s.
The
task of mining association rules is to find all the asso-
ciation rules whose support is larger than a
mznamum
support threshold
and whose confidence is larger than
a
mznzmum confidence threshold.
For
an itemset
X,
its
support
is the percentage
of
transactions in
DB
which contains
X,
and its
support
count,
denoted by
X.sup,
is the number of transactions
in
DB
containing
X.
An itemset
X
is
large
(or
more
precisely,
frequently occurrzng)
if its support is no less
than the minimum support threshold. An itemset of
size
k
is called a
k-ztemset.
It has been shown that the
problem of mining association rules can be reduced to
two subproblems [2]:
(1)
find
all large itemsets
for
Q
gaven mznzmum support threshold,
and (2)
generate the
association rules from the large atemsets found.
Since
(1)
dominates the overall cost of mining association
rules, the research has been focused on how to develop
efficient methods to solve the first subproblem
[a].
An interesting algorithm,
Aprzorz
[a],
has been pro-
posed for computing large itemsets at mining asso-
ciation rules in a transaction database. There have
been many studies on mining association rules using
sequential algorithms in centralized databases (e.g.,
32

[lo,
14,
8,
12, 4, 15]),
which can be viewed
as
vari-
ations
or
extensions to Apriori. For example, as an
extension to Apriori, the DHP algorithm
[lo]
uses a
direct hashing technique to eliminate some size-2 can-
didate sets in the Apriori algorithm.
2.2
Distributed Algorithm for Mining
As-
sociation Rules
We examine the mining of association rules in a
distributed environment. Let
DB
be
a
database
with
D
transactions. Assume that there are
n
sites
S1,S2,.
. .
,
Sn
in a distributed system and the
database
DB
is partitioned over the
n
sites into
(DB1, DB2,.
. .
,
DB,},
respectively.
Let the size of the partitions
DBi
be
Di,
for
i
=
1,.
.
.
,
n.
Let
X.sup
and
X.supi
be the support
counts of an itemset
X
in
DB
and
DBi,
respectively.
X.sup
is called the
global support count,
and
X.supi
the
local support count
of
X
at site
Si.
For
a
given
minimum support threshold
s,
X
is
globally large
if
X.sup
2
s
x
D;
correspondingly,
X
is
locally large
at
site
Si,
if
X.supi
2
s
x
Di.
In the following,
L
de-
notes the globally large itemsets in
DB,
and
L(k)
the
globally large k-itemsets in
L.
The essential task
of
a distributed association rule mining algorithm is to
find the globally large itemsets
L.
For comparison, we outline the Count Distribution
(CD)
algorithm as the follows
[l].
The algorithm is an
adaptation of the Apriori algorithm in the distributed
case. At each iteration, CD generates the candidate
sets at every site by applying the Apriorigen function
on the set of large itemsets found at the previous it-
eration. Every site then computes the local support
counts
of
all these candidate sets and broadcasts them
to all the other sites. Subsequently, all the sites can
find the globally large itemsets for that iteration, and
then proceed to the next iteration.
3
Techniques
for
Distributed Data
3.1
Generation of Candidate Sets
It is important to observe some interesting proper-
ties related to large itemsets in distributed environ-
ments since such properties may substantially reduce
the number of messages to be passed across network
at mining association rules.
There is an important relationship between large
itemsets and the sites in a distributed database:
every
globally large itemsets must be locally large at
some
site(s).
If an itemset
X
is
both globally large and locally
large
at a site
Si, X
is called
gl-large
at
site
Si.
The
set of gl-large itemsets at a site will form a basis for
the site to generate its own candidate sets.
Two monotonic properties can be easily observed
from the locally large and gl-large itemsets. First, if
an
itemset
X
is locally large
at
a
site
Si,
then all
of
its subsets are also locally large
at
site
Si.
Secondly,
if an itemset
X
is gl-large at a site
Si,
then all of
Mining
its subsets are also gl-large at site
Si.
Notice that a
similar relationship exists among the large itemsets in
the centralized case. Following is an important result
based
on
which an effective technique
for
candidate
sets generation in the distributed case is developed.
Lemma
1
If
an
itemset
X
is globally large, there ex-
ists a site
Si,
(1
<
i
<
n),
such that
X
and all its
subsets are gl-large at site
Si.
Proof.
If
X
is not locally large at any site, then
X.supi
<
s
x
Di
for all
i
=
1,.
.
. ,
n.
Therefore,
X.sup
<
s
x
D,
and
X
cannot be globally large. By
contradiction,
X
must be locally large at some site
Si,
and hence
X
is gl-large at
Si.
Consequently, all the
0
We use
GLi
to denote the set of gl-large itemsets
at site
Si,
and
GLi(k)
to denote the set of gl-large
k-
itemsets at site
Si.
It follows from Lemma
1
that if
X
E
L(k),
then there exists a site
si,
such that all its
size-(k
-
1)
subsets are gl-large at site
Si,
i.e., they
belong to
GLi(k-1).
In
a
straightforward adaptation of Apriori, the set
of candidate sets at the k-th iteration, denoted by
CA(k),
which stands for size-k candidate sets from
Apriori, would be generated by applying the Apri-
origen function on
L(k-1).
That is,
subsets of
X
must also be gl-large at
Si.
CA(k)
=
Apriori-gen(L(k-1)).
At each site
Si,
let
CGi(k)
be the set of candidates
sets generated by applying Apriorigen on
GLi(k-11,
i.e.,
CGi(k)
=
Apriori-gen(
GL,(k-
1
)),
where
CG
stands for candidate sets generated from
gl-large itemsets. Hence
CGi(k)
is generated from
GLi(k-l).
Since
GLi(k-1)
5
L(k-l),
CGqk)
is a sub-
set of
CA(k).
In the following, we use
CG(k)
to denote
the set
Uy="=,Gi(k).
Theorem
1
For
every
IC
>
1,
the set
of
all large
k-
itemsets
L(k)
is
Q
subset
of
CG(k)
=
CGi(k),
where
CGi(k)
=
Apriori-gen(
GL,(k-
1)).
Proof.
Let
X
E
L(k).
It follows from Lemma
1
that
there exists a site
Si,
(1
5
i
<
n),
such that all the
size-(k
-
1)
subsets of
X
are gl-large at site
Si.
Hence
X
E
CGi(k).
Therefore,
L(k)
G
CG(k)
=
U
CGi(k)
=
U
Apriori-gen(GL,(k-I)).
n
n
i=l
i=l
U
Theorem
1
indicates that
CG(k),
which is a subset
of
CA(k)
and could be much smaller than
CA(,),
can
be taken
as
the
set
of candidate sets for the size-k large
itemsets. The difference between the two sets,
CA(k)

and
CG(k),
depends on the distribution of the item-
sets. This theorem forms a basis for the generation
of the set of candidate sets in the algorithm FDM.
First the set of candidate sets
CG!i(k)
can be gener-
ated locally at each site
Si
at the k-th iteration. After
the exchange of support counts, the gl-large itemsets
GLqk)
in
CGi(k)
can be found at the end of that itera-
tion. Based on
GL;(k),
the candidate sets at
Si
for the
(k
+
1)-st
iteration can then be generated. According
to the performance study in Section 5, by using this
approach, the number of candidate sets generated can
be substantially reduced to about
10
-
25% of that
generated in CD.
Example
1
illustrates the effectiveness of the reduc-
tion of candidate
sets
using Theorem 1.
Example
1
Assuming there are
3
sites in a system
which partitions the
DB
into
DB1, DB2
and
DB3.
Suppose the set of large 1-itemsets (computed at
the first iteration)
L(1)
=
{A,B,C,D, E, F,G,H},
in which
A,
B,
and
C
are locally large
at
site
SI,
I?,
6,
and
D
are locally large at site
S2,
and
E,
F,C,
and
H
are locally large
att
site
S3.
There-
fore,
GIql)
=
(A,B,C},
GL2(1)
=
{B,C, D},
and
GL3(1)
=
{E,F,G,
H}.
Based on Theo-
rem
I,
the set of size-2 candidate sets
at
site
SI
is
CG1(2),
where
CGI(2)
=
Apriori.gen
(GLI(1))
=
(AB, BC,
AC}.
Similarly,
CG2(2)
=
{BC,
CD,
BD},
and
CG3(2)
=
{EF, EG, EH, FG, FH,GH}.
Hence,
the set of candidate sets for large 2-itemsets is
CG(2)
=
CGl(2)
U
CGZ(2)
U
CG3(2),
total
11
candi-
dates. However, if Apriori-gen is applied to
L(1),
the
set of candidate sets
CA(2)
=
Apriori-gen(L(l)) would
have
28
candidates. This shows that it is very effective
to apply Theorem
1
to reduce the candidate
sets.
0
3.2
Local
Pruning
of
Candidate Sets
The previous subsection shows that based on The-
orem
1,
one can usually generate in a distributed en-
vironment a much smaller set of candidate sets than
the direct application of the Apriori algorithm.
When the set of candidate set
C'G(k)
is generated,
to find the globally large itemsets, i,he support counts
of the candidate sets must be exchainged among all the
sites. Notice that some candidate sets in
CG(k)
can be
pruned by a
local
pruning
technique before count ex-
change starts. The general idea is that
at
each site
Si,
if a candidate set
X
E
CG,(k)
is not locally large
at
site
Si,
there is no need
for
S,
to find out its global support
count to determine whether it is gllobally large. This
is
because in this case, either
X
is small (not glob-
ally large),
or
it will be locally large at some other
site, and hence only the site(s) at which
X
is locally
large need to be responsible to find the global support
count of
X.
Therefore, in order to compute all the
large k-itemsets,
at
each site
Si,
the candidate
sets
can be confined
to
only the sets
X
E
CGi(k)
which are
locally large at site
Si.
For convenience, we use
LL;(k)
to denote those candidate sets in
CGi(k)
which are lo-
cally large at site
Si.
Based on the above discussion,
at every iteration (the k-th iteration), the gl-large
k-
itemsets can be computed at each site
Si
according to
the following procedure.
1.
Candidate sets generation:
Generate the candidate
sets
CGi(k)
based on the gl-large itemsets found
at site
Si
at the
(k
-
1)-st iteration using the
formula,
CG;(k)
=
Apriori-gen
(
GLz(k-l)).
2.
Local pruning:
For each
X
E
CGi(k),
scan the
partition
DBi
to compute the local support count
X.supi.
If
X
is not locally large at site
Si,
it is
excluded from the candidate sets
&(k).
(Note:
This pruning only removes
X
from the candidate
set at site
Si.
X
could still be
a
candidate set at
some other site.)
3.
Support count exchange:
Broadcast the candidate
sets in
LL;(k)
to other sites to collect support
counts. Compute their global support counts and
find all the gl-large k-itemsets in site
Si.
4.
Broadcast mining results:
Broadcast the computed
gl-large k-itemsets to all the other sites.
For
clarity, the notations used
so
far are listed in
Table
1.
number of transactions in
DB
support threshold
minsup
globally large k-itemsets
candidate sets generated from
L(k)
global support count of
X
number of transactions in
DBi
gl-large k-itemsets at
Si
candidate sets generated from
GLi(k-1)
locally large k-itemsets in
CGi(k)
local support count of
X
at
Si
Table
1:
Notation Table.
To illustrate the above procedure. we continue
working on Example
1
as follows.
Example
2
Assume the database in Example
1
con-
tains 150 transactions and each one of the
3
parti-
tions has
50
transactions. Also assume that the sup-
port threshold
s
=
10%.
Moreover, according to
Ex-
ample
1,
at the second iteration, the candidate sets
generated at site
SI
are
CG1(2)
=
{AB, BC,AC};
at site
S2, CGq2)
=
{BC, BD,
CD};
and at site
S3,
CG3(2)
=
(EF, EG, EH, FG,
FH,
GH}.
In order to
compute the large 2-itemsets, the local support counts
34

Table
2:
Locally Large Itemsets.
large request
candidates from
AB
s1
BC
Sl,
s2
CD
s2
EF
s3
GH
s3
at each site is computed first. The result is recorded
in Table 2.
From Table
2,
it can be seen that
AC.sup1
=
2
<
s
x
D1
=
5.
AC
is not locally large. Hence, the
candidate set
AC
is pruned away at site
SI.
On the
other hand, both
AB
and
BC
have enough local sup
port counts and they survive the local pruning. Hence
LLq2)
=
{AB, BC}.
Similarly,
LL2(2)
=
{BC, CD},
and
LL3(2)
=
{EF,
GH}.
After the local pruning, the
number of size-2 candidate sets has been reduced to
five which is less than half of the original size. Once
the local pruning is completed, each site broadcasts
messages containing all the remaining candidate sets
to the other sites to collect their support counts. The
result of this count support exchange is recorded in
Table
3.
1
5
4 4
10
10
2
4
8
4
4
3
8
4
4
6
Table
3:
Globally Large Itemsets.
The request for support count for
AB
is broad-
casted from
SI
to site
S2
and
5’3,
and the counts
sent back are recorded at site
S1
as in the second row
of Table
3.
The other rows record similar count ex-
change activities at the other sites. At the end of
the iteration, site
S1
finds out that only
BC
is gl-
large, because
BC.sup
=
22
>
s
x
D
=
15,
and
AB.sup
=
13
<
s
x
D
=
15.
Hence the gl-large
2-itemset at site
S1
is
GLl(2)
=
{BC}.
Similarly,
GL2(2)
=
{BC,CD}
and
GL3(2)
=
{EF}.
After the
broadcast of the gl-large itemsets, all sites return the
large 2-itemsets
42)
=
{BC,
CD,
EF}.
Notice that
some
candidate set,
such
as
BC
in
this
example, could be locally large at more than one site.
In this case, the messages are broadcasted from all the
sites
at
which
BC
is found to be locally large. This
is unnecessary because for each of candidate itemset,
only one broadcast is needed. In Section
3.4,
an opti-
mization technique to eliminate such redundancy will
be discussed.
0
There is a subtlety in the implementation of the
four steps outlined above for finding globally large
itemsets. In order to support both step
2,
“local prun-
ing”, and step
3,
“support count exchange”, each site
Si
must have two sets of support counts.
For
local
pruning,
Si
has to find the local support counts
of
its
candidate sets
CGi(k).
For support count exchange,
Si
has to find the local support counts
of
some possi-
bly different candidate sets from other sites in order
to answer the count requests from these sites. A sim-
ple approach would be to scan
DBi
twice, once for
collection of the counts for the local
CGqk),
and once
for responding to the count requests from other sites.
However, this would substantially degrade the perfor-
mance.
At
Si,
not only is
CG;(k)
available at the beginning of the
H-th iteration, but also are other sets, i.e.,
CGj(k)
(j
=
1,.
. .
,
n,
j
#
i),
because all the
GLi(k-l),
(i
=
1,.
.
.
,
n),
are broadcasted to every site
at
the
end of the
(H
-
1)-st iteration, and the sets
of
can-
didate sets
CGqk),
(i
=
1,
.
.
.
,
n),
are computed from
the corresponding
GLi(k-1).
That is, at the beginning
of each iteration, since all the gl-large itemsets found
at the previous iteration have been broadcasted to all
the sites, every site can compute the candidate sets
of
every other site. Therefore, the local support counts
of all these candidate sets can be found in one scan
and stored in a data structure like the hash-tree used
in Apriori
[2].
Using this technique, the data structure
can be built in one scan, and the two different sets of
support counts required in the local pruning and sup-
port count exchange can be retrieved from this data
structure.
3.3
Global
Pruning
of
Candidate Sets
The local pruning
at
a site
Si
uses only the local
support counts found in
DBi
to prune a candidate
set. In fact, the local support counts from other sites
can also be used for pruning. A
global
pruning
tech-
nique is developed to facilitate such pruning and is
outlined as follows. At the end of each iteration, all
the local support and global support counts of a can-
didate set
X
are available. These local support counts
can be broadcasted together with the global support
counts after a candidate set is found to be globally
large. Using this information, some global pruning
can be performed on the candidate sets at the subse-
quent iteration.
Assume
that the local support count
of
every can-
didate itemset is broadcasted to all the sites after it
is found to be globally large at the end of an itera-
In fact, there is no need of two scans.
35

Citations
More filters
Book

Data Mining: Concepts and Techniques

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

Data Mining: Concepts and Techniques (2nd edition)

TL;DR: There have been many data mining books published in recent years, including Predictive Data Mining by Weiss and Indurkhya [WI98], Data Mining Solutions: Methods and Tools for Solving Real-World Problems by Westphal and Blaxton [WB98], Mastering Data Mining: The Art and Science of Customer Relationship Management by Berry and Linofi [BL99].
Journal ArticleDOI

Scalable algorithms for association mining

TL;DR: Efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the association mining task are presented and the effect of using different database layout schemes combined with the proposed decomposition and traverse techniques are presented.
Journal ArticleDOI

Frequent pattern mining: current status and future directions

TL;DR: It is believed that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run, however, there are still some challenging research issues that need to be solved before frequent patternmining can claim a cornerstone approach in data mining applications.
Journal ArticleDOI

Privacy-preserving distributed mining of association rules on horizontally partitioned data

TL;DR: In this paper, the authors address secure mining of association rules over horizontally partitioned data. And they incorporate cryptographic techniques to minimize the information shared, while adding little overhead to the mining task.
References
More filters
Proceedings Article

Fast algorithms for mining association rules

TL;DR: Two new algorithms for solving thii problem that are fundamentally different from the known algorithms are presented and empirical evaluation shows that these algorithms outperform theknown algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems.
Proceedings ArticleDOI

Mining sequential patterns

TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Journal ArticleDOI

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

TL;DR: The PVM system, a heterogeneous network computing trends in distributed computing PVM overview other packages, and troubleshooting: geting PVM installed getting PVM running compiling applications running applications debugging and tracing debugging the system.
Proceedings Article

Efficient and Effective Clustering Methods for Spatial Data Mining

TL;DR: The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.
Proceedings Article

An Efficient Algorithm for Mining Association Rules in Large Databases

TL;DR: This paper presents an efficient algorithm for mining association rules that is fundamentally different from known algorithms and not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases.