Communication efficient construction of decision trees over heterogeneously distributed data
read more
Citations
Random projection-based multiplicative data perturbation for privacy preserving distributed data mining
PLANET: massively parallel learning of tree ensembles with MapReduce
A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms
Distributed Decision-Tree Induction in Peer-to-Peer Systems
Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network
References
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization
Popular ensemble methods: an empirical study
Building decision tree classifier on private data
An algorithmic theory of learning: robust concepts and random projection
Data Mining: Next Generation Challenges and Future Directions
Related Papers (5)
Frequently Asked Questions (11)
Q2. What are the future works mentioned in the paper "Communication efficient construction of decision trees over heterogeneously distributed data" ?
A primary set of directions for future work is motivated by the fact that their distributed algorithm requires more computation ( local ) than the centralized algorithm. One direction for future work involves carrying out a careful timing study to compare the total algorithm times ( distributed vs. centralized ) taking into account communication delays. Indeed, another direction for future work involves incorporating secure multi-party computation ( SMC ) based protocols to address privacy constrains while retaining low communication complexity.
Q3. What is the primary goal of many methods in the literature?
Since communication is assumed to be carried out exclusively by message passing, a primary goal of many methods in the literature is to minimize the number of messages sent.
Q4. What is the main idea of DDM?
At the heart of this approach is the use of random projections to estimate the dot product between two binary vectors and some message optimization techniques.
Q5. how much communication cost is required to achieve a tree?
The authors observed that by using only 20% of the communication cost necessary to centralize the data the authors can achieve trees with accuracy at least 80% of the CA.
Q6. how many messages can be used in a tree?
the authors built a tree using CA (with the standard Weka tree builder implementation) and others using DA while varying the number of messages used in information gain approximation and the depth of the tree.
Q7. What is the main purpose of DDM?
The bulk of DDM methods in the literature operate over an abstract architecture where each site has a private memory containing its own portion of the data.
Q8. Why is the distributed setting more natural than the centralized one?
For some applications, the distributed setting is more natural than the centralized one because the data is inherently distributed.
Q9. What is the purpose of the paper?
In this paper, the authors introduce an algorithm for constructing decision trees in a distributed environment where communications resources are limited and efficient use of the available resources is needed.
Q10. What are the main topics of the paper?
These provide a broad overview of DDM touching on issues such as: association rule mining, clustering, basic statistics computation, Bayesian network learning, classification, and the historical roots of DDM.
Q11. What is the common assumption in the literature regarding heterogeneous data?
The authors assume that each site has the same number of tuples (records) and they are ordered to facilitate matching, i.e., the ith tuple on each site matches.