scispace - formally typeset
Open AccessProceedings ArticleDOI

Communication efficient construction of decision trees over heterogeneously distributed data

Reads0
Chats0
TLDR
An algorithm designed to efficiently construct a decision tree over heterogeneously distributed data without centralizing is presented and its experimental results show that by using only 20% of the communication cost necessary to centralize the data it can achieve trees with accuracy at least 80%" of the trees produced by the centralized version.
Abstract
We present an algorithm designed to efficiently construct a decision tree over heterogeneously distributed data without centralizing We compare our algorithm against a standard centralized decision tree implementation in terms of accuracy as well as the communication complexity Our experimental results show that by using only 20% of the communication cost necessary to centralize the data we can achieve trees with accuracy at least 80% of the trees produced by the centralized version

read more

Content maybe subject to copyright    Report

Communication Efficient Construction of Decision Trees Over Heterogeneously
Distributed Data
Chris Giannella Kun Liu Todd Olsen
Hillol Kargupta
Department of Computer Science and Electrical Engineering
University of Maryland Baltimore County, Baltimore, MD 21250 USA
{cgiannel,kunliu1,tolsen1,hillol}@cs.umbc.edu
(H. Kargupta is also affiliated with AGNIK, LLC, USA.)
Abstract
We present an algorithm designed to efficiently construct
a decision tree over heterogeneously distributed data with-
out centralizing. We compare our algorithm against a stan-
dard centralized decision tree implementation in terms of
accuracy as well as the communication complexity. Our
experimental results show that by using only 20% of the
communication cost necessary to centralize the data we can
achieve trees with accuracy at least 80% of the trees pro-
duced by the centralized version.
Key words: Decision Trees, Distributed Data Mining,
Random Projection
1 Introduction
Much of the world’s data is distributed over a multi-
tude of systems connected by communications channels of
varying capacity. In suc h an environment, efficient use of
available communications resources can be very important
for practical data mining algorithms. In this paper, we in-
troduce an algorithm for constructing decision trees in a
distributed environment where communications resources
are limited and efficient use of the available resources is
needed. At the heart of this approach is the use of ran-
dom projections to estimate the dot product between two
binary vectors and some message optimization techniques.
Before defining the problem and discussing our approach,
we briefly discuss distributed data mining to provide con-
text.
1.1 Distributed Data Mining (DDM)
Overview: Bluntly put, DDM is data mining where the
data and computation are spread over many independent
sites. For some applications, the distributed setting is more
natural than the centralized one because the data is inher-
ently distributed. The bulk of DDM methods in the liter-
ature operate over an abstract architecture where each site
has a private memory containing its own portion of the data.
The sites can operate independently and communicate by
message passing over an asynchronous network. Typically
communication is a bottleneck. Since communication is as-
sumed to be carried out exclusively by message passing, a
primary goal of many methods in the literature is to mini-
mize the number of messages sent. Similarly, our goal is to
minimize the number of messages sent. For more informa-
tion about DDM, the reader is referred to two recent surveys
[8], [10]. These provide a broad overview of DDM touching
on issues such as: association rule mining, clustering, basic
statistics computation, Bayesian network learning, classifi-
cation, and the historical roots of DDM.
Data format: It is commonly assumed in the DDM lit-
erature that each site stores its data in tables. Due to the
ubiquitous nature of relational databases, this assumption
covers a lot of ground. One of two additional assumptions
are commonly made regarding how the data is distributed
across sites: homogene ously (horizontally partitioned) or
heterogen e ously (vertically partitioned). Both assumptions
adopt the conceptual viewpoint that the tables at each site
are partitions of a sin gle global table.
1
In the homogeneous
case, the global table is horizontally partitioned. The ta-
bles at each site are subsets of the global table; they have
exactly the same attributes. In the heterogeneous case, the
table is vertically partitioned, each site contains a collection
of columns (sites do not have the same attributes). How-
ever, each tuple at each site is assumed to contain a unique
identifier to facilitate matching acro ss sites (matched tuples
contain the same identifier).
1
It is not assumed that the global table has been or ever was physically
realized.
Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)
0-7695-2142-8/04 $ 20.00 IEEE

Note that the definition of “heterogeneous” in our paper
differs from that used in other research fields such as the
Semantic Web and Data Integration. In particular we are
not addressing the problem of schema matching.
1.2 Problem Definition and Results Summary
We consider the problem of building a decision tree over
heterogeneously distributed data. We assume that each site
has the same number of tuples (records) and they are or-
dered to facilitate matching, i.e.,thei
th
tuple on each site
matches. This assumption is equivalent to the commonly
made assumptions regarding heterogeneously distributed
data described earlier. We also assume th at the i
th
tuple
on each site has the same class label. Our approach can be
applied to an arbitrary number of sites, but f or simplicity,
we restrict o urselves to the case of only two parties: Adam
and Betty. However, in section 4.3 we describe the commu-
nication complexity for an arbitrary number o f sites. At the
end, Adam and Betty are to each have the decision tree in its
entirety. Our primary objective is to minimize the number
of messages transmitted.
One way to solve this problem is to transmit all of the
data from Adam’s site to Betty. She then applies a standard
centralized decision tree builder and finally, transmits the fi-
nal tree back to Adam. We call this method the centralized
approach (CA). While straightforward, the CA may require
excessive communication in low communication bandwidth
environments. To address this problem, we have adapted a
standard decision tree building algorithm to the heteroge-
neous environment. The main problem in doing so is com-
puting the info rmation gain offered by attributes in making
splitting decisions. To reduce communication, we approx-
imate information gain using a random projection based
technique. The technique converges on the correct informa-
tion gain as the n umber of messages transmitted increases.
We call this approach to building a decision tree the dis-
tributed approach (DA).
The tree produced by DA may not be the same as that
produced by CA. However, by increasing the number of
messages transmitted, the DA tree can be made arbitrarily
close. We conducted several experiments to measure the
trade-off between accuracy and communication. Specifi-
cally, we built a tree using CA (with the standard We ka tree
builder implementation) and others using DA while varying
the number of messages used in information g ain approxi-
mation and the depth of the tree. We observed that by using
only 20% of the communicatio n cost necessary to centralize
the data we can achieve trees with accuracy at least 80% of
the CA. Henceforth, when we discuss communication cost
or communication complexity, we mean the total number of
messages required. A message is a four byte number e.g. a
standard floating point number.
1.3 Paper Layout
In Section 2 we cite some related work. In Section 3
we describe the basic algorithm for building a decision tree
over heterogeneously distributed data using a distributed dot
product as the primary distributed operation. Then we pro-
pose a method for approximating a distributed dot product
using a random projection. In Section 4 we describe the
complete algorithm and give the communication complex-
ity. In Section 5 we discuss how different message op-
timization techniques are employed to further reduce the
communication. In Sections 6 we present the results of our
experiments. Finally, in Section 7 we describe several di-
rections for future work and conclusions.
2 Related Work
Most algorithms for learning from homogeneously dis-
tributed data (horizontally partitioned) are directly related
to ensemble learning [9, 3], meta-learning [12] and rule-
based [5] combination techniques. In the heterogeneous
case, each site observes only partial attributes (features) of
the data set. Traditional ensemble-based approaches usu-
ally generate high variance local models and fail to detect
the interaction between features observed at different sites.
This makes the problem fundamentally challenging. The
work addressed in [11] develops a framework to learn de-
cision tree from heterogeneous data using a scalable evolu-
tionary technique. In order to detect global patterns, they
first make use of boosting technique to identify a subset of
the data that none of the local classifiers can classify with
high confidence. This subset of the data is merged at the
central site and another new classifier is constructed from
it. When a combination of local classifiers cannot classify
a new record with a high confidence, the central classifier
is used instead. This approach exhibits a better accuracy
than a simple aggregation of the mo dels. However, its per-
formance is sensitive to the confidence threshold. Further-
more, to reduce the complexity of the models, this algo-
rithm applies a Fourier Spectrum-based technique to aggre-
gate all the local and central classifiers. However, the cost of
computing the Fourier Coefficient grows exponentially with
the number of attributes. On the other hand, our algorithm
generates a single decision tree for all the data sites and
does not need to aggregate at all. The work in [2] presents
a general strategy of distributed decision tree learning by
exchanging among different sites the indices and counts of
the records that satisfy specified constraints on the values
of particular attributes. The resulting algorithm is provably
exact compared with the decision tree constructed on the
centralized data. The communication complexity is given
by O((M + |L|NV )ST) where M is the total number of
records, |L| is the number of classes, N is the total number
Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)
0-7695-2142-8/04 $ 20.00 IEEE

of attributes, V is the maximum number of possible values
per attribute, S is the n umber of sites and T is the number
of nodes of the tree. However, instead of repeatedly send-
ing the whole indices vectors to the other site, our algorithm
applies a random projection-based strategy to compute dis-
tributed dot product as the building blocks of tree induction.
This kind of dimension reduction technique, together with
some other message reusing and message sharing schemas
reduce as many unnecessary messages as possible. The
number of messages for one dot product is bounded by
O(k)(k<<M), and the total communication cost of our
algorithm is O((LT + kIT )(S 1)) (LT is the number of
leaf node and IT is the number of non-leaf node), which is
less than that in [2]. The work presented in [4] deals with
a privacy preserving two-party decision tree learning prob-
lem where no party is willing to divulge their data to the
other. The basic tree induction procedure is similar with
ours. However, a secure dot product protocol is proposed
here as the building block such that only the information
gain of the testing attribute is disclosed to both parties and
nothing else. The communication complexity of only one
dot product protocol is O(4M ), the total commun icatio n
cost is higher than ours.
3 Building a Distributed Decision Tree: the
Basic Algorithm
For simplicity of exposition, we only discuss discrete
data and assume that each node of the tree has a correspond-
ing attribute and a child branch for each distinct value. Our
algorithm, however, generalizes to other cases (e.g. contin-
uous attributes) without any conceptual difficulties.
3.1 Notation
Both sites have M tuples ordered in such a way that tuple
i on Adam’s site corresponds to tuple i on Betty’s site. Tu-
ples on both sites have an associated class label drawn from
asetL. The tuples are labeled consistently across sites i.e.
the i
th
tuple on Adam and Bettys site has the same class
label. Let N denote the total number of attributes from all
sites.
Let A denote the union of attributes over both sites and
denote the data set formed by joining the data from both
sites (Adam’s i
th
tuple is concatenated with Betty’s to form
the i
th
tuple in
). Given attribute A ∈A,letΠ(A) denote
the set of distinct values that appear in the A column. Given
set of attributes X ⊆Aand list of values x ∈×
AX
Π(A),
let
(X = x) denote the set of tuples t in
such that the X
columns of t agree with x i.e. for all A X, t[A]=x[A].
Given
ˆ
, attribute A ∈Aand value a Π(A),
let #
A=a
(
ˆ
) denote the number of tuples t in
ˆ
such that
Rainy
Rainy
Rainy
Rainy
Sunny
Outlook
No
Yes
Yes
Yes
No
Play
No
Yes
Yes
Yes
No
Play
Normal
Normal
High
High
High
Humidity
1
1
1
1
0
1
2
Adam
Betty
1
1
1
1
0
0
1
0
0
0
No
Yes
Yes
Yes
No
Play
Normal
Normal
High
High
High
Humidity
1
1
1
1
0
0
0
1
1
0
0
1
0
0
0
0
0
1
1
0
Figure 1. Calculating information g ain using
the dot product. ("Play" is t he class name,
and · denotes the dot product.)
t[A]=a. Given class label L,let#
(
ˆ
) denote the
number of tuples in
ˆ
with label .Let#
,A=a
(
ˆ
) denote
the number of tuples t in
ˆ
with t[A]=a and label .The
class entropy of A over
ˆ
is denoted E
A
(
ˆ
) and defined as
2
aΠ(A)
#
A=a
(
ˆ
)
|
ˆ
|
L
#
,A=a
(
ˆ
)
|
ˆ
|
log
2
(
#
,A=a
(
ˆ
)
|
ˆ
|
).
The information gain of A over
ˆ
is denoted G
A
(
ˆ
) and
defined as
L
#
(
ˆ
)
|
ˆ
|
log
2
(
#
(
ˆ
)
|
ˆ
|
) E
A
(
ˆ
).
Our distributed decision tree building approach can b e
applied without change to other forms of information gain
such as the Gini index. For ease of discussion, we stick with
entropy b ased information gain.
3.2 Distributed Decision Tree Building Using a
Dot Product
We adapt the following version of the standard, depth-
first decision tree building algorithm (on discrete data). Ini-
tially the tree is empty and the first call is made to determine
the root node. The call chooses the attribute A
1
from A with
the largest information g ain over
to become the root. For
each a Π(A
1
), a recursive call is made with list {(A
1
,a)}.
Each of these recursive calls will determine the children of
the root (with branches labeled with the values in Π(A
1
)).
At any call passed list (A
1
,a
1
), ... (A
k
,a
k
),thetu-
ples in
(X = x) where X = {A
1
,...,A
k
} and x =
(a
1
,...,a
k
) are examined to determine the next splitting
attribute. The attribute from A−X with the largest infor-
mation gain over
(X = x) is chosen.
2
We assume 0log
2
(0) equals zero.
Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)
0-7695-2142-8/04 $ 20.00 IEEE

Since the attributes are not all on one site, computing
the information g ain may not be possible. For example, as-
sume at least one of the attributes from X were on Adam’s
site and consider D an attribute on Betty’s site and not in
X. To compute the information gain, Betty must com-
pute #
(
(X = x)) and #
,D=d
(
(X = x)) for all d
Π(D) and L. These values cannot be computed di-
rectly since Betty does not have
(X = x). Adam must
send Betty information to carry out this computation. To
reduce the amount of messages we approximate the values
using a technique based on random projections.
Each of the values can be modeled as a dot product com-
putation (similar to [2] and [4]). Let X
A
denote the at-
tributes from X on Adam’s site and x
A
their associated val-
ues from the passed list; likewise define X
B
and x
B
.Let
V (X
A
= x
A
) be a length M vector of zeros and ones. The
i
th
entry is one if the i
th
tuple t
i
in
satisfies t
i
[X
A
]=x
A
.
All other entries are zero. Likewise, let
V (D = d, X
B
=
x
B
,) be the 0/1 vector whose i
th
entry is one if t
i
[D]=d,
t
i
[X
B
]=x
B
and t
i
has label . I t can be easily seen that the
dot product of
V (X
A
= x
A
) and
V (D = d, X
B
= x
B
,)
equals #
,D=d
(
(X = x)). Moreover, the dot product of
V (X
A
= x
A
) and
V (X
b
= x
B
,) equals #
(
(X = x)).
Figure 1 illustrates this concept. Adam sends Betty a bi-
nary vector representing the tuples with “Outlook = Rainy”.
Betty constructs two vectors representing “Hum idity = Nor-
mal && Play = Yes” and “Humidity = High && Play =
Yes”, respectively. The dot products gives the number of
tuples in the whole database that satisfy the constrains “Out-
look = Rainy && Humidity = Normal && Play = Yes” and
“Outlook = Rainy && Humidity = High && Play = Yes”.
Note that the notatio n above deals with the case where Betty
computes the information gain of her attributes. However,
our algorithm will also require the reverse case: Adam com-
putes the information gain of all his attributes. The notation
is analogous. Actually, in our algorithm, instead of send-
ing the orig inal binary vectors directly to the other site, we
project the vectors into a lower dimensional space first and
transmitting the new vectors to all other sites. This leads to
the distributed dot product computation in the next section.
3.3 Distributed Dot Product
In the previous section, we observed that distributed dot
product of boolean vectors is the building block of deci-
sion tree induction. In this section, we propose a random
projection-based distributed dot product technique that can
greatly reduce the dimensionality of the vector, thereby re-
ducing the cost of building the tree. Similar form of this
algorithm appears elsewhere in a different context [7].
Given vectors a =(a
1
,...,a
m
)
T
and
b =
(b
1
,...,b
m
)
T
at two distributed site A and B, respectively,
we want to approximate a
T
b using a small number of mes-
sages between A and B. Algorithm 3.3.1 gives the detailed
procedure.
Algorithm 3.3.1 Distributed Dot Product Algorithm(a,
b)
1. A sends B a random number generator seed. [1 mes-
sage]
2. A and B cooperatively generate k × m random matrix
R where k m. Each entry is generated independently
and identically from any d istribution with zero mean and
unit variance. A and B compute ˆa = Ra,
ˆ
b = R
b,re-
spectively.
3. A sends ˆa to B. B computes ˆa
T
ˆ
b = a
T
R
T
R
b. [k
messages]
4. B computes D =
ˆa
T
ˆ
b
k
.
So instead of sending a m-dimensional vector to the
other site, we only need to send a k-dimensional vector
where k m and the dot product can still be estimated.
The above algorithm is based on the following fact:
Lemma 3.1 Let R be a p × q dimensional random matrix
such that each entry r
i,j
of R is independently and chosen
from some distribution with zero mean and unit variance.
Then,
E[R
T
R]=pI, and E[RR
T
]=qI.
Proof Sketch: The (i, j) entry of R
T
R is the dot product of
the i
th
and j
th
columns of R.Ifi = j, then the expected
value of the dot product equals the p times the variance p lus
the square of the mean, hence, p.Ifi = j, then the expected
value of the dot product equals p tim es the square of the
mean, hence zero. The second part of the lemma is proven
analogously.
Intuitively, this result echoes the observation made else-
where [6] that in a high-dimensional space vectors with ran-
dom directions are almost orthogonal. A similar result was
proved elsewhere [1].
3.4 Accuracy Analysis
We give a Chernoff-like bound to quantify the accuracy
of our distributed dot product for decision tree induction as
follows:
Lemma 3.2 Let a and
b be any two boolean vectors. Let ˆa
and
ˆ
b be the projections of a and
b to
k
through a random
matrix R whose entries are identically, independently cho-
sen from N(0,1) such that ˆa = Ra and
ˆ
b = R
b, then for any
>0, we have
Pr{a
T
b m
ˆa
T
ˆ
b
k
a
T
b + m}≥
1 3
(1 + )e
k
2
+ ((1 )e
)
k
2
Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)
0-7695-2142-8/04 $ 20.00 IEEE

k Mean Var Min Max
100(1%) 0.1483 0.0098 0.0042 0.3837
500(5%) 0.0795 0.0035 0.0067 0.2686
1000(10%) 0.0430 0.0008 0.0033 0.1357
2000(20%) 0.0299 0.0007 0.0012 0.0902
3000(30%) 0.0262 0.0005 0.0002 0.0732
Table 1. Relative errors in computing the dot
product.
Proof: Omitted due to space constraints.
This bound shows that the error goes to 0 exponentially
fast as k increases. Note that although the lemma is based
on n ormal distribution w ith zero mean an d unit variance, it
is also true for other distributions that are symmetric about
the o rigin with unit variance. Table 1 depicts the relative er-
ror of the distributed dot product between two synthetically
generated binary vectors of size 10000. k is the number of
randomized iterations (represented as the percentage of the
size of the original vectors). Each entry of the random ma-
trix is chosen independently and uniformly from {1, 1}.
In practice, this bound can be used to find the suitable k.
4 Algorithm Details
4.1 Main Procedure
At the commencement of the algorithm, each site deter-
mines which local attribute offers the largest information
gain. No communication is required to accomplish this. The
best attribute from each site is then compared and the at-
tribute with the globally largest information gain, A
G
,isse-
lected to define the split at the root node of the tree. For each
distinct value a Π(A
G
), a new branch leading down from
the root is created. For each these branches, the site contain-
ing A
G
constructs a binary vector representing which tuples
correspond to this new branch,
V (A
G
= a), and sends the
projection of it to the other site. Upon receiving each vec-
tor, the other site indexes it according to it’s path and stores
it in a vector cache for later use.
At each non-root node Z, each party, P attempts to find
the nearest closest ancestor of Z that splits on an attribute
not local to P (one may not exist). Consider Figure 2 with
P = Adam. When considering node Z1, path (1), the near-
est non-local ancestor would be the grandparent of Z1.For
node Z2, path (2), the nearest non-local ancestor would be
the parent of Z2. For node Z3 no non-local ancestor exists.
If P fails to find a non-local ancestor for Z (i.e..the
search terminated at root)then P does not require any infor-
mation from the other party to compu te the information gain
of it’s attributes. In this case the evaluation of the in forma-
tion gain proceeds as it does at root and can be calculated
exactly. Otherwise, P retrieves the appro priate entry from
it’s vector cache and uses it to approximate the information
gain of it’s local attributes using the distributed dot product.
Note that,in either case, no communication is required.
As before, once each site determines the local attribute
with the largest information gain, the attribute with the glob-
ally largest information gain, A
G
, is selected to define the
split at Z. Following this, each party now executes one of
the following actions
If A
G
is local to P then, for each a Π(A
G
),anew
branch leading down from the root is created. For
each branch, P constructs a binary vector represent-
ing which tuples corresponding to this new path and
sends the projection of it the other party.
If A
G
is non-local then P waits until it receives the
projection vector from the other party, indexes it ac-
cording to it’s respective path, and stores it in the local
vector cache.
The total number of messages required the above actions
is k (the number of columns of R).
In order to reduce the memory signature of the algorithm
each site will occasionally check the contents of it’s vector
cache and delete any invalid entries. A vector becomes in-
valid when (1) every path associated with that vector ter-
minates in a leaf node, or (2) the node which g enerated the
vector is no longer the nearest non-local ancestor to any of
it’s descendants.
We made one minor change to the algorithm presented
above. When the number of ones/zeros in a binary vector
is less than the number of iterations k, we can just transmit
the list of indices directly. Not only does this reduce the
communication cost of the algorithm even further, it allows
the calculation of in f ormation gain fu rther down the tree to
be made in an exact, rather than approximate, manner.
4.2 Leaf Nodes Determination
The construction of a path down the decision tree con-
tinues until a leaf node is reached, which meets at least one
of the following criteria: (1) All of the tuples for the node
belong to one class. The node is then labeled by that class.
(2) If any child of a node is empty, label that child as a leaf
representing the most frequent class in this node. (3) There
are less than minNumObj (4 in our experiments) tuples for
the node, regardless of class. The node is then labeled by
the most frequent class o f all the tuples in this node. Note
here that the calculation s used to determine this may be ap-
proximations based on the distributed dot product.
From the above criteria, we can see that the determina-
tion of a leaf node can actually be made by its parent since
information gain computation enables the parent to get the
Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)
0-7695-2142-8/04 $ 20.00 IEEE

Citations
More filters
Journal ArticleDOI

Random projection-based multiplicative data perturbation for privacy preserving distributed data mining

TL;DR: This paper proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data and presents extensive theoretical analysis and experimental results.
Journal ArticleDOI

PLANET: massively parallel learning of tree ensembles with MapReduce

TL;DR: This paper describes PLANET: a scalable distributed framework for learning tree models over large datasets, and shows how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models.
Journal ArticleDOI

A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms

TL;DR: A novel generalized approach using the well-known energy compaction power of Fourier-related transforms to hide sensitive data values and to approximately preserve Euclidean distances in centralized and distributed scenarios to a great degree of accuracy is proposed.
Journal IssueDOI

Distributed Decision-Tree Induction in Peer-to-Peer Systems

TL;DR: This paper offers a scalable and robust distributed algorithm for decision-tree induction in large peer-to-peer (P2P) environments and offers low communication overhead, a necessity for scalability.
Journal ArticleDOI

Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network

TL;DR: An approximate local algorithm for identifying top-l, inner products among pairs of feature vectors in a large asynchronous distributed environment such as a peer-to-peer (P2P) network is presented and a probabilistic algorithm for this purpose is developed using order statistics and the Hoeffding bound.
References
More filters
Journal ArticleDOI

An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

TL;DR: In this article, the authors compared the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5 and found that in situations with little or no classification noise, randomization is competitive with bagging but not as accurate as boosting.
Journal ArticleDOI

Popular ensemble methods: an empirical study

TL;DR: This work suggests that most of the gain in an ensemble's performance comes in the first few classifiers combined; however, relatively large gains can be seen up to 25 classifiers when Boosting decision trees.

Building decision tree classifier on private data

TL;DR: This paper presents a protocol that allows Alice and Bob to conduct such a classifier building without having to compromise their privacy, and is built upon a useful building block, the scalar product protocol.
Journal ArticleDOI

An algorithmic theory of learning: robust concepts and random projection

TL;DR: This work provides a novel algorithmic analysis via a model of robust concept learning (closely related to “margin classifiers”), and shows that a relatively small number of examples are sufficient to learn rich concept classes.
Book

Data Mining: Next Generation Challenges and Future Directions

TL;DR: This collection surveys the most recent advances in the field and charts directions for future research, discussing topics that include distributed data mining algorithms for new application areas, several aspects of next-generation data mining systems and applications, and detection of recurrent patterns in digital media.
Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?

The authors present an algorithm designed to efficiently construct a decision tree over heterogeneously distributed data without centralizing. Their experimental results show that by using only 20 % of the communication cost necessary to centralize the data the authors can achieve trees with accuracy at least 80 % of the trees produced by the centralized version. 

A primary set of directions for future work is motivated by the fact that their distributed algorithm requires more computation ( local ) than the centralized algorithm. One direction for future work involves carrying out a careful timing study to compare the total algorithm times ( distributed vs. centralized ) taking into account communication delays. Indeed, another direction for future work involves incorporating secure multi-party computation ( SMC ) based protocols to address privacy constrains while retaining low communication complexity. 

Since communication is assumed to be carried out exclusively by message passing, a primary goal of many methods in the literature is to minimize the number of messages sent. 

At the heart of this approach is the use of random projections to estimate the dot product between two binary vectors and some message optimization techniques. 

The authors observed that by using only 20% of the communication cost necessary to centralize the data the authors can achieve trees with accuracy at least 80% of the CA. 

the authors built a tree using CA (with the standard Weka tree builder implementation) and others using DA while varying the number of messages used in information gain approximation and the depth of the tree. 

The bulk of DDM methods in the literature operate over an abstract architecture where each site has a private memory containing its own portion of the data. 

For some applications, the distributed setting is more natural than the centralized one because the data is inherently distributed. 

In this paper, the authors introduce an algorithm for constructing decision trees in a distributed environment where communications resources are limited and efficient use of the available resources is needed. 

These provide a broad overview of DDM touching on issues such as: association rule mining, clustering, basic statistics computation, Bayesian network learning, classification, and the historical roots of DDM. 

The authors assume that each site has the same number of tuples (records) and they are ordered to facilitate matching, i.e., the ith tuple on each site matches.