A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

doi:10.1109/TPDS.2016.2603511

Open AccessJournal ArticleDOI

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Jianguo Chen, +6 more

- 01 Apr 2017 -

IEEE Transactions on Parallel and Distri...

- Vol. 28, Iss: 4, pp 919-933

TLDR

In this paper, a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform is presented. And the PRF algorithm is optimized based on a hybrid approach combining dataparallel and task-parallel optimization, and a dual parallel approach is carried out in the training process of RF and a task Directed Acyclic Graph (DAG) is created according to the parallel training process.

Abstract:

With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability. With the expansion of the scale of the random forest model and the Spark cluster, the advantage of the PRF algorithm is more obvious.

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Citations

Integrated Blockchain and Edge Computing Systems: A Survey, Some Research Issues and Challenges

Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data.

Stochastic Online Learning for Mobile Edge Computing: Learning from Changes

A collaborative architecture of the industrial internet platform for manufacturing systems

Multiple convolutional neural networks for multivariate time series prediction

References

Random Forests

UCI Machine Learning Repository

Conditional variable importance for random forests

Data mining with big data

Analysis of a random forests model

Related Papers (5)

Random Forests

MapReduce: simplified data processing on large clusters

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Spark: cluster computing with working sets

Scikit-learn: Machine Learning in Python