how much communication cost is required to achieve a tree?

The authors observed that by using only 20% of the communication cost necessary to centralize the data the authors can achieve trees with accuracy at least 80% of the CA.

(Open Access) Communication efficient construction of decision trees over heterogeneously distributed data (2004) | C. Giannella

Q: What are the contributions mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?

The authors present an algorithm designed to efficiently construct a decision tree over heterogeneously distributed data without centralizing. Their experimental results show that by using only 20 % of the communication cost necessary to centralize the data the authors can achieve trees with accuracy at least 80 % of the trees produced by the centralized version.

Q: What are the future works mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?

A primary set of directions for future work is motivated by the fact that their distributed algorithm requires more computation ( local ) than the centralized algorithm. One direction for future work involves carrying out a careful timing study to compare the total algorithm times ( distributed vs. centralized ) taking into account communication delays. Indeed, another direction for future work involves incorporating secure multi-party computation ( SMC ) based protocols to address privacy constrains while retaining low communication complexity.

Q: What is the primary goal of many methods in the literature?

Since communication is assumed to be carried out exclusively by message passing, a primary goal of many methods in the literature is to minimize the number of messages sent.

Q: What is the main idea of DDM?

At the heart of this approach is the use of random projections to estimate the dot product between two binary vectors and some message optimization techniques.

Q: how many messages can be used in a tree?

the authors built a tree using CA (with the standard Weka tree builder implementation) and others using DA while varying the number of messages used in information gain approximation and the depth of the tree.

Q: What is the main purpose of DDM?

The bulk of DDM methods in the literature operate over an abstract architecture where each site has a private memory containing its own portion of the data.

Q: What is the purpose of the paper?

In this paper, the authors introduce an algorithm for constructing decision trees in a distributed environment where communications resources are limited and efficient use of the available resources is needed.

Communication Efﬁcient Construction of Decision Trees Over Heterogeneously

Distributed Data

Chris Giannella Kun Liu Todd Olsen

Hillol Kargupta

Department of Computer Science and Electrical Engineering

University of Maryland Baltimore County, Baltimore, MD 21250 USA

{cgiannel,kunliu1,tolsen1,hillol}@cs.umbc.edu

(H. Kargupta is also afﬁliated with AGNIK, LLC, USA.)

Abstract

We present an algorithm designed to efﬁciently construct

a decision tree over heterogeneously distributed data with-

out centralizing. We compare our algorithm against a stan-

dard centralized decision tree implementation in terms of

accuracy as well as the communication complexity. Our

experimental results show that by using only 20% of the

communication cost necessary to centralize the data we can

achieve trees with accuracy at least 80% of the trees pro-

duced by the centralized version.

Key words: Decision Trees, Distributed Data Mining,

Random Projection

1 Introduction

Much of the world’s data is distributed over a multi-

tude of systems connected by communications channels of

varying capacity. In suc h an environment, efﬁcient use of

available communications resources can be very important

for practical data mining algorithms. In this paper, we in-

troduce an algorithm for constructing decision trees in a

distributed environment where communications resources

are limited and efﬁcient use of the available resources is

needed. At the heart of this approach is the use of ran-

dom projections to estimate the dot product between two

binary vectors and some message optimization techniques.

Before deﬁning the problem and discussing our approach,

we brieﬂy discuss distributed data mining to provide con-

text.

1.1 Distributed Data Mining (DDM)

Overview: Bluntly put, DDM is data mining where the

data and computation are spread over many independent

sites. For some applications, the distributed setting is more

natural than the centralized one because the data is inher-

ently distributed. The bulk of DDM methods in the liter-

ature operate over an abstract architecture where each site

has a private memory containing its own portion of the data.

The sites can operate independently and communicate by

message passing over an asynchronous network. Typically

communication is a bottleneck. Since communication is as-

sumed to be carried out exclusively by message passing, a

primary goal of many methods in the literature is to mini-

mize the number of messages sent. Similarly, our goal is to

minimize the number of messages sent. For more informa-

tion about DDM, the reader is referred to two recent surveys

[8], [10]. These provide a broad overview of DDM touching

on issues such as: association rule mining, clustering, basic

statistics computation, Bayesian network learning, classiﬁ-

cation, and the historical roots of DDM.

Data format: It is commonly assumed in the DDM lit-

erature that each site stores its data in tables. Due to the

ubiquitous nature of relational databases, this assumption

covers a lot of ground. One of two additional assumptions

are commonly made regarding how the data is distributed

across sites: homogene ously (horizontally partitioned) or

heterogen e ously (vertically partitioned). Both assumptions

adopt the conceptual viewpoint that the tables at each site

are partitions of a sin gle global table.

In the homogeneous

case, the global table is horizontally partitioned. The ta-

bles at each site are subsets of the global table; they have

exactly the same attributes. In the heterogeneous case, the

table is vertically partitioned, each site contains a collection

of columns (sites do not have the same attributes). How-

ever, each tuple at each site is assumed to contain a unique

identiﬁer to facilitate matching acro ss sites (matched tuples

contain the same identiﬁer).

It is not assumed that the global table has been or ever was physically

realized.

Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)

0-7695-2142-8/04 $ 20.00 IEEE

Note that the deﬁnition of “heterogeneous” in our paper

differs from that used in other research ﬁelds such as the

Semantic Web and Data Integration. In particular we are

not addressing the problem of schema matching.

1.2 Problem Deﬁnition and Results Summary

We consider the problem of building a decision tree over

heterogeneously distributed data. We assume that each site

has the same number of tuples (records) and they are or-

dered to facilitate matching, i.e.,thei

tuple on each site

matches. This assumption is equivalent to the commonly

made assumptions regarding heterogeneously distributed

data described earlier. We also assume th at the i

tuple

on each site has the same class label. Our approach can be

applied to an arbitrary number of sites, but f or simplicity,

we restrict o urselves to the case of only two parties: Adam

and Betty. However, in section 4.3 we describe the commu-

nication complexity for an arbitrary number o f sites. At the

end, Adam and Betty are to each have the decision tree in its

entirety. Our primary objective is to minimize the number

of messages transmitted.

One way to solve this problem is to transmit all of the

data from Adam’s site to Betty. She then applies a standard

centralized decision tree builder and ﬁnally, transmits the ﬁ-

nal tree back to Adam. We call this method the centralized

approach (CA). While straightforward, the CA may require

excessive communication in low communication bandwidth

environments. To address this problem, we have adapted a

standard decision tree building algorithm to the heteroge-

neous environment. The main problem in doing so is com-

puting the info rmation gain offered by attributes in making

splitting decisions. To reduce communication, we approx-

imate information gain using a random projection based

technique. The technique converges on the correct informa-

tion gain as the n umber of messages transmitted increases.

We call this approach to building a decision tree the dis-

tributed approach (DA).

The tree produced by DA may not be the same as that

produced by CA. However, by increasing the number of

messages transmitted, the DA tree can be made arbitrarily

close. We conducted several experiments to measure the

trade-off between accuracy and communication. Speciﬁ-

cally, we built a tree using CA (with the standard We ka tree

builder implementation) and others using DA while varying

the number of messages used in information g ain approxi-

mation and the depth of the tree. We observed that by using

only 20% of the communicatio n cost necessary to centralize

the data we can achieve trees with accuracy at least 80% of

the CA. Henceforth, when we discuss communication cost

or communication complexity, we mean the total number of

messages required. A message is a four byte number e.g. a

standard ﬂoating point number.

1.3 Paper Layout

In Section 2 we cite some related work. In Section 3

we describe the basic algorithm for building a decision tree

over heterogeneously distributed data using a distributed dot

product as the primary distributed operation. Then we pro-

pose a method for approximating a distributed dot product

using a random projection. In Section 4 we describe the

complete algorithm and give the communication complex-

ity. In Section 5 we discuss how different message op-

timization techniques are employed to further reduce the

communication. In Sections 6 we present the results of our

experiments. Finally, in Section 7 we describe several di-

rections for future work and conclusions.

2 Related Work

Most algorithms for learning from homogeneously dis-

tributed data (horizontally partitioned) are directly related

to ensemble learning [9, 3], meta-learning [12] and rule-

based [5] combination techniques. In the heterogeneous

case, each site observes only partial attributes (features) of

the data set. Traditional ensemble-based approaches usu-

ally generate high variance local models and fail to detect

the interaction between features observed at different sites.

This makes the problem fundamentally challenging. The

work addressed in [11] develops a framework to learn de-

cision tree from heterogeneous data using a scalable evolu-

tionary technique. In order to detect global patterns, they

ﬁrst make use of boosting technique to identify a subset of

the data that none of the local classiﬁers can classify with

high conﬁdence. This subset of the data is merged at the

central site and another new classiﬁer is constructed from

it. When a combination of local classiﬁers cannot classify

a new record with a high conﬁdence, the central classiﬁer

is used instead. This approach exhibits a better accuracy

than a simple aggregation of the mo dels. However, its per-

formance is sensitive to the conﬁdence threshold. Further-

more, to reduce the complexity of the models, this algo-

rithm applies a Fourier Spectrum-based technique to aggre-

gate all the local and central classiﬁers. However, the cost of

computing the Fourier Coefﬁcient grows exponentially with

the number of attributes. On the other hand, our algorithm

generates a single decision tree for all the data sites and

does not need to aggregate at all. The work in [2] presents

a general strategy of distributed decision tree learning by

exchanging among different sites the indices and counts of

the records that satisfy speciﬁed constraints on the values

of particular attributes. The resulting algorithm is provably

exact compared with the decision tree constructed on the

centralized data. The communication complexity is given

by O((M + |L|NV )ST) where M is the total number of

records, |L| is the number of classes, N is the total number

Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)

0-7695-2142-8/04 $ 20.00 IEEE

of attributes, V is the maximum number of possible values

per attribute, S is the n umber of sites and T is the number

of nodes of the tree. However, instead of repeatedly send-

ing the whole indices vectors to the other site, our algorithm

applies a random projection-based strategy to compute dis-

tributed dot product as the building blocks of tree induction.

This kind of dimension reduction technique, together with

some other message reusing and message sharing schemas

reduce as many unnecessary messages as possible. The

number of messages for one dot product is bounded by

O(k)(k<<M), and the total communication cost of our

algorithm is O((LT + kIT )(S − 1)) (LT is the number of

leaf node and IT is the number of non-leaf node), which is

less than that in [2]. The work presented in [4] deals with

a privacy preserving two-party decision tree learning prob-

lem where no party is willing to divulge their data to the

other. The basic tree induction procedure is similar with

ours. However, a secure dot product protocol is proposed

here as the building block such that only the information

gain of the testing attribute is disclosed to both parties and

nothing else. The communication complexity of only one

dot product protocol is O(4M ), the total commun icatio n

cost is higher than ours.

3 Building a Distributed Decision Tree: the

Basic Algorithm

For simplicity of exposition, we only discuss discrete

data and assume that each node of the tree has a correspond-

ing attribute and a child branch for each distinct value. Our

algorithm, however, generalizes to other cases (e.g. contin-

uous attributes) without any conceptual difﬁculties.

3.1 Notation

Both sites have M tuples ordered in such a way that tuple

i on Adam’s site corresponds to tuple i on Betty’s site. Tu-

ples on both sites have an associated class label drawn from

asetL. The tuples are labeled consistently across sites i.e.

the i

tuple on Adam and Betty’s site has the same class

label. Let N denote the total number of attributes from all

sites.

Let A denote the union of attributes over both sites and



denote the data set formed by joining the data from both

sites (Adam’s i

tuple is concatenated with Betty’s to form

the i

tuple in



). Given attribute A ∈A,letΠ(A) denote

the set of distinct values that appear in the A column. Given

set of attributes X ⊆Aand list of values x ∈×

A∈X

Π(A),

let



(X = x) denote the set of tuples t in



such that the X

columns of t agree with x i.e. for all A ∈ X, t[A]=x[A].

Given



⊆



, attribute A ∈Aand value a ∈ Π(A),

let #

A=a

(



) denote the number of tuples t in



such that

Rainy

Sunny

Outlook

Yes

Play

Yes

Play

Normal

High

Humidity

Adam

Betty

•

Yes

Play

Normal

High

Humidity

•

Figure 1. Calculating information g ain using

the dot product. ("Play" is t he class name,

and · denotes the dot product.)

t[A]=a. Given class label  ∈ L,let#



(



) denote the

number of tuples in



with label .Let#

,A=a

(



) denote

the number of tuples t in



with t[A]=a and label .The

class entropy of A over



is denoted E

(



) and deﬁned as

−



a∈Π(A)

A=a

(



)





∈L

,A=a

(



)



log

(

,A=a

(



)



The information gain of A over



is denoted G

(



) and

deﬁned as

−



∈L



(



)



log

(



(



)



) − E

(



Our distributed decision tree building approach can b e

applied without change to other forms of information gain

such as the Gini index. For ease of discussion, we stick with

entropy b ased information gain.

3.2 Distributed Decision Tree Building Using a

Dot Product

We adapt the following version of the standard, depth-

ﬁrst decision tree building algorithm (on discrete data). Ini-

tially the tree is empty and the ﬁrst call is made to determine

the root node. The call chooses the attribute A

from A with

the largest information g ain over



to become the root. For

each a ∈ Π(A

), a recursive call is made with list {(A

,a)}.

Each of these recursive calls will determine the children of

the root (with branches labeled with the values in Π(A

)).

At any call passed list (A

), ... (A

),thetu-

ples in



(X = x) where X = {A

,...,A

} and x =

,...,a

) are examined to determine the next splitting

attribute. The attribute from A−X with the largest infor-

mation gain over



(X = x) is chosen.

We assume 0log

(0) equals zero.

Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)

0-7695-2142-8/04 $ 20.00 IEEE

Since the attributes are not all on one site, computing

the information g ain may not be possible. For example, as-

sume at least one of the attributes from X were on Adam’s

site and consider D an attribute on Betty’s site and not in

X. To compute the information gain, Betty must com-

pute #



(



(X = x)) and #

,D=d

(



(X = x)) for all d

∈ Π(D) and  ∈ L. These values cannot be computed di-

rectly since Betty does not have



(X = x). Adam must

send Betty information to carry out this computation. To

reduce the amount of messages we approximate the values

using a technique based on random projections.

Each of the values can be modeled as a dot product com-

putation (similar to [2] and [4]). Let X

denote the at-

tributes from X on Adam’s site and x

their associated val-

ues from the passed list; likewise deﬁne X

and x

.Let



V (X

= x

) be a length M vector of zeros and ones. The

entry is one if the i

tuple t



satisﬁes t

]=x

All other entries are zero. Likewise, let



V (D = d, X

x

,) be the 0/1 vector whose i

entry is one if t

[D]=d,

]=x

and t

has label . I t can be easily seen that the

dot product of



V (X

= x

) and



V (D = d, X

= x

,)

equals #

,D=d

(



(X = x)). Moreover, the dot product of



V (X

= x

) and



V (X

= x

,) equals #



(



(X = x)).

Figure 1 illustrates this concept. Adam sends Betty a bi-

nary vector representing the tuples with “Outlook = Rainy”.

Betty constructs two vectors representing “Hum idity = Nor-

mal && Play = Yes” and “Humidity = High && Play =

Yes”, respectively. The dot products gives the number of

tuples in the whole database that satisfy the constrains “Out-

look = Rainy && Humidity = Normal && Play = Yes” and

“Outlook = Rainy && Humidity = High && Play = Yes”.

Note that the notatio n above deals with the case where Betty

computes the information gain of her attributes. However,

our algorithm will also require the reverse case: Adam com-

putes the information gain of all his attributes. The notation

is analogous. Actually, in our algorithm, instead of send-

ing the orig inal binary vectors directly to the other site, we

project the vectors into a lower dimensional space ﬁrst and

transmitting the new vectors to all other sites. This leads to

the distributed dot product computation in the next section.

3.3 Distributed Dot Product

In the previous section, we observed that distributed dot

product of boolean vectors is the building block of deci-

sion tree induction. In this section, we propose a random

projection-based distributed dot product technique that can

greatly reduce the dimensionality of the vector, thereby re-

ducing the cost of building the tree. Similar form of this

algorithm appears elsewhere in a different context [7].

Given vectors a =(a

,...,a

)

and



b =

,...,b

)

at two distributed site A and B, respectively,

we want to approximate a



b using a small number of mes-

sages between A and B. Algorithm 3.3.1 gives the detailed

procedure.

Algorithm 3.3.1 Distributed Dot Product Algorithm(a,



1. A sends B a random number generator seed. [1 mes-

sage]

2. A and B cooperatively generate k × m random matrix

R where k  m. Each entry is generated independently

and identically from any d istribution with zero mean and

unit variance. A and B compute ˆa = Ra,

b = R



b,re-

spectively.

3. A sends ˆa to B. B computes ˆa

b = a



b. [k

messages]

4. B computes D =

ˆa

So instead of sending a m-dimensional vector to the

other site, we only need to send a k-dimensional vector

where k  m and the dot product can still be estimated.

The above algorithm is based on the following fact:

Lemma 3.1 Let R be a p × q dimensional random matrix

such that each entry r

i,j

of R is independently and chosen

from some distribution with zero mean and unit variance.

Then,

E[R

R]=pI, and E[RR

]=qI.

Proof Sketch: The (i, j) entry of R

R is the dot product of

the i

and j

columns of R.Ifi = j, then the expected

value of the dot product equals the p times the variance p lus

the square of the mean, hence, p.Ifi = j, then the expected

value of the dot product equals p tim es the square of the

mean, hence zero. The second part of the lemma is proven

analogously.



Intuitively, this result echoes the observation made else-

where [6] that in a high-dimensional space vectors with ran-

dom directions are almost orthogonal. A similar result was

proved elsewhere [1].

3.4 Accuracy Analysis

We give a Chernoff-like bound to quantify the accuracy

of our distributed dot product for decision tree induction as

follows:

Lemma 3.2 Let a and



b be any two boolean vectors. Let ˆa

and

b be the projections of a and



b to 

through a random

matrix R whose entries are identically, independently cho-

sen from N(0,1) such that ˆa = Ra and

b = R



b, then for any

>0, we have

Pr{a



b − m ≤

ˆa

≤ a



b + m}≥

1 − 3





(1 + )e

−



+ ((1 − )e



)



Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)

0-7695-2142-8/04 $ 20.00 IEEE

k Mean Var Min Max

100(1%) 0.1483 0.0098 0.0042 0.3837

500(5%) 0.0795 0.0035 0.0067 0.2686

1000(10%) 0.0430 0.0008 0.0033 0.1357

2000(20%) 0.0299 0.0007 0.0012 0.0902

3000(30%) 0.0262 0.0005 0.0002 0.0732

Table 1. Relative errors in computing the dot

product.

Proof: Omitted due to space constraints.

This bound shows that the error goes to 0 exponentially

fast as k increases. Note that although the lemma is based

on n ormal distribution w ith zero mean an d unit variance, it

is also true for other distributions that are symmetric about

the o rigin with unit variance. Table 1 depicts the relative er-

ror of the distributed dot product between two synthetically

generated binary vectors of size 10000. k is the number of

randomized iterations (represented as the percentage of the

size of the original vectors). Each entry of the random ma-

trix is chosen independently and uniformly from {1, −1}.

In practice, this bound can be used to ﬁnd the suitable k.

4 Algorithm Details

4.1 Main Procedure

At the commencement of the algorithm, each site deter-

mines which local attribute offers the largest information

gain. No communication is required to accomplish this. The

best attribute from each site is then compared and the at-

tribute with the globally largest information gain, A

,isse-

lected to deﬁne the split at the root node of the tree. For each

distinct value a ∈ Π(A

), a new branch leading down from

the root is created. For each these branches, the site contain-

ing A

constructs a binary vector representing which tuples

correspond to this new branch,



V (A

= a), and sends the

projection of it to the other site. Upon receiving each vec-

tor, the other site indexes it according to it’s path and stores

it in a vector cache for later use.

At each non-root node Z, each party, P attempts to ﬁnd

the nearest closest ancestor of Z that splits on an attribute

not local to P (one may not exist). Consider Figure 2 with

P = Adam. When considering node Z1, path (1), the near-

est non-local ancestor would be the grandparent of Z1.For

node Z2, path (2), the nearest non-local ancestor would be

the parent of Z2. For node Z3 no non-local ancestor exists.

If P fails to ﬁnd a non-local ancestor for Z (i.e..the

search terminated at root)then P does not require any infor-

mation from the other party to compu te the information gain

of it’s attributes. In this case the evaluation of the in forma-

tion gain proceeds as it does at root and can be calculated

exactly. Otherwise, P retrieves the appro priate entry from

it’s vector cache and uses it to approximate the information

gain of it’s local attributes using the distributed dot product.

Note that,in either case, no communication is required.

As before, once each site determines the local attribute

with the largest information gain, the attribute with the glob-

ally largest information gain, A

, is selected to deﬁne the

split at Z. Following this, each party now executes one of

the following actions

• If A

is local to P then, for each a ∈ Π(A

),anew

branch leading down from the root is created. For

each branch, P constructs a binary vector represent-

ing which tuples corresponding to this new path and

sends the projection of it the other party.

• If A

is non-local then P waits until it receives the

projection vector from the other party, indexes it ac-

cording to it’s respective path, and stores it in the local

vector cache.

The total number of messages required the above actions

is k (the number of columns of R).

In order to reduce the memory signature of the algorithm

each site will occasionally check the contents of it’s vector

cache and delete any invalid entries. A vector becomes in-

valid when (1) every path associated with that vector ter-

minates in a leaf node, or (2) the node which g enerated the

vector is no longer the nearest non-local ancestor to any of

it’s descendants.

We made one minor change to the algorithm presented

above. When the number of ones/zeros in a binary vector

is less than the number of iterations k, we can just transmit

the list of indices directly. Not only does this reduce the

communication cost of the algorithm even further, it allows

the calculation of in f ormation gain fu rther down the tree to

be made in an exact, rather than approximate, manner.

4.2 Leaf Nodes Determination

The construction of a path down the decision tree con-

tinues until a leaf node is reached, which meets at least one

of the following criteria: (1) All of the tuples for the node

belong to one class. The node is then labeled by that class.

(2) If any child of a node is empty, label that child as a leaf

representing the most frequent class in this node. (3) There

are less than minNumObj (4 in our experiments) tuples for

the node, regardless of class. The node is then labeled by

the most frequent class o f all the tuples in this node. Note

here that the calculation s used to determine this may be ap-

proximations based on the distributed dot product.

From the above criteria, we can see that the determina-

tion of a leaf node can actually be made by its parent since

information gain computation enables the parent to get the

Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04)

0-7695-2142-8/04 $ 20.00 IEEE

Communication efficient construction of decision trees over heterogeneously distributed data

Figures

Citations

Random projection-based multiplicative data perturbation for privacy preserving distributed data mining

PLANET: massively parallel learning of tree ensembles with MapReduce

A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms

Distributed Decision-Tree Induction in Peer-to-Peer Systems

Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network

References

An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Popular ensemble methods: an empirical study

Building decision tree classifier on private data

An algorithmic theory of learning: robust concepts and random projection

Data Mining: Next Generation Challenges and Future Directions

Related Papers (5)

C4.5: Programs for Machine Learning

UCI Machine Learning Repository

On the privacy preserving properties of random data perturbation techniques

Classification and Regression Trees.

Privacy Preserving Data Mining

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?

Q2. What are the future works mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?

Q3. What is the primary goal of many methods in the literature?

Q4. What is the main idea of DDM?

Q5. how much communication cost is required to achieve a tree?

Q6. how many messages can be used in a tree?

Q7. What is the main purpose of DDM?

Q8. Why is the distributed setting more natural than the centralized one?

Q9. What is the purpose of the paper?

Q10. What are the main topics of the paper?

Q11. What is the common assumption in the literature regarding heterogeneous data?