scispace - formally typeset
Search or ask a question

Showing papers on "Spark (mathematics) published in 2021"


Journal ArticleDOI
TL;DR: A semisupervised prediction model is proposed, which exploits the improved unsupervised clustering algorithm to establish the fuzzy partition function, and then utilize the neural network model to build the information prediction function.
Abstract: The continuous development of industry big data technology requires better computing methods to discover the data value. Information forecast, as an important part of data mining technology, has achieved excellent applications in some industries. However, the existing deviation and redundancy in the data collected by the sensors make it difficult for some methods to accurately predict future information. This article proposes a semisupervised prediction model, which exploits the improved unsupervised clustering algorithm to establish the fuzzy partition function, and then utilize the neural network model to build the information prediction function. The main purpose of this article is to effectively solve the time analysis of massive industry data. In the experimental part, we built a data platform on Spark, and used some marine environmental factor datasets and UCI public datasets as analysis objects. Meanwhile, we analyzed the results of the proposed method compared with other traditional methods, and the running performance on the Spark platform. The results show that the proposed method achieved satisfactory prediction effect.

104 citations


Journal ArticleDOI
TL;DR: This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion detection by employing Apache Spark, as a big data processing tool, and proposes a hybrid scheme that combines the advantages of deep network and machine learning methods.
Abstract: This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion detection It employs Apache Spark, as a big data processing tool, for processing a large size of network traffic data Also, we propose a hybrid scheme that combines the advantages of deep network and machine learning methods Initially, stacked autoencoder network is used for latent feature extraction, which is followed by several classification-based intrusion detection methods, such as support vector machine, random forest, decision trees, and naive Bayes which are used for fast and efficient detection of intrusion in massive network traffic data A real time UNB ISCX 2012 dataset is used to validate our proposed method and the performance is evaluated in terms of accuracy, f-measure, sensitivity, precision and time

67 citations


Journal ArticleDOI
TL;DR: In this paper, a distributed convolutional-neural-networks (DCNN) based approach for big remote sensing image classification is proposed, which is the first study of its kind, which proposes a novel distributed deep learning-based approach for the classification of big Remote Sensing images.

55 citations


Journal ArticleDOI
TL;DR: The results showed that well-trained machine learning models can complement more complex physical model while also helping with optimizing the engine performance, emissions, and life.

53 citations


Journal ArticleDOI
TL;DR: This paper developed a deep learning algorithm that provides early warning signals (EWS) in systems it was not explicitly trained on, by exploiting information about normal forms and scaling behavior of dynamics near tipping points that are common to many dynamical systems.
Abstract: Many natural systems exhibit tipping points where slowly changing environmental conditions spark a sudden shift to a new and sometimes very different state. As the tipping point is approached, the dynamics of complex and varied systems simplify down to a limited number of possible "normal forms" that determine qualitative aspects of the new state that lies beyond the tipping point, such as whether it will oscillate or be stable. In several of those forms, indicators like increasing lag-1 autocorrelation and variance provide generic early warning signals (EWS) of the tipping point by detecting how dynamics slow down near the transition. But they do not predict the nature of the new state. Here we develop a deep learning algorithm that provides EWS in systems it was not explicitly trained on, by exploiting information about normal forms and scaling behavior of dynamics near tipping points that are common to many dynamical systems. The algorithm provides EWS in 268 empirical and model time series from ecology, thermoacoustics, climatology, and epidemiology with much greater sensitivity and specificity than generic EWS. It can also predict the normal form that characterizes the oncoming tipping point, thus providing qualitative information on certain aspects of the new state. Such approaches can help humans better prepare for, or avoid, undesirable state transitions. The algorithm also illustrates how a universe of possible models can be mined to recognize naturally occurring tipping points.

53 citations


Journal ArticleDOI
TL;DR: This paper proposes the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data, and proposes an efficient implementation of the discussed algorithm on Apache Spark.
Abstract: Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced .

50 citations


Journal ArticleDOI
TL;DR: This study aims to help retail companies create personalized deals and promotions for their customers, even during the COVID-19 pandemic, through a big data framework that allows them to handle massive sales volumes with more efficient models.
Abstract: Retail companies recognize the need to analyze and predict their sales and customer behavior against their products and product categories Our study aims to help retail companies create personalized deals and promotions for their customers, even during the COVID-19 pandemic, through a big data framework that allows them to handle massive sales volumes with more efficient models In this paper, we used Black Friday sales data taken from a dataset on the Kaggle website, which contains nearly 550,000 observations analyzed with 10 features: Qualitative and quantitative The class label is purchases and sales (in U S dollars) Because the predictor label is continuous, regression models are suited in this case Using the Apache Spark big data framework, which uses the MLlib machine learning library, we trained two machine learning models: Linear regression and random forest These machine learning algorithms were used to predict future pricing and sales We first implemented a linear regression model and a random forest model without using the Spark framework and achieved accuracies of 68% and 74%, respectively Then, we trained these models on the Spark machine learning big data framework where we achieved an accuracy of 72% for the linear regression model and 81% for the random forest model © 2021, Tech Science Press All rights reserved

45 citations


Journal ArticleDOI
TL;DR: The K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks, and the results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoops.
Abstract: One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

43 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an efficient rolling bearing fault diagnosis method based on Spark and improved random forest (IRF) algorithm by eliminating the decision trees with low classification accuracy and those prone to repeated voting in the original RF, an improved RF with faster diagnosis speed and higher classification accuracy is constructed.
Abstract: The random forest (RF) algorithm is a typical representative of ensemble learning, which is widely used in rolling bearing fault diagnosis. In order to solve the problems of slower diagnosis speed and repeated voting of traditional RF algorithm in rolling bearing fault diagnosis under the big data environment, an efficient rolling bearing fault diagnosis method based on Spark and improved random forest (IRF) algorithm is proposed. By eliminating the decision trees with low classification accuracy and those prone to repeated voting in the original RF, an improved RF with faster diagnosis speed and higher classification accuracy is constructed. For the massive rolling bearing vibration data, in order to improve the training speed and diagnosis speed of the rolling bearing fault diagnosis model, the IRF algorithm is parallelized on the Spark platform. First, an original RF model is obtained by training multiple decision trees in parallel. Second, the decision trees with low classification accuracy in the original RF model are filtered. Third, all path information of the reserved decision trees is obtained in parallel. Fourth, a decision tree similarity matrix is constructed in parallel to eliminate the decision trees which are prone to repeated voting. Finally, an IRF model which can diagnose rolling bearing faults quickly and effectively is obtained. A series of experiments are carried out to evaluate the effectiveness of the proposed rolling bearing fault diagnosis method based on Spark and IRF algorithm. The results show that the proposed method can not only achieve good fault diagnosis accuracy, but also have fast model training speed and fault diagnosis speed for large-scale rolling bearing datasets.

39 citations


Journal ArticleDOI
TL;DR: A four-stage MapReduce framework that is solely based on the well-known Spark platform for use in high-utility sequential pattern mining is presented, shown to create a more efficient and faster mining performance for dealing with large data sets.
Abstract: The concepts of sequential pattern mining have become a growing topic in data mining, finding a home most recently in the Internet of Things (IoT) where large volumes of data are presented by the second for analysis and knowledge extraction. One key topic within the realm of sequential pattern mining in high-utility sequential pattern mining (HUSPM), short form for high-utility sequential pattern mining. HUSPM takes into account the fusion of utility and sequence factors to assist in the determination of sequential patterns of high utility from databases and data sources. That being said, almost all current existing literature focus on only using a single machine to increase mining performance. In this work, we present a four-stage MapReduce framework that is solely based on the well-known Spark platform for use in HUSPM. This framework is shown to create a more efficient and faster mining performance for dealing with large data sets. It consists of four phases such as initialization, mining, updating, and generation phases to handle the big data sets based on the MapReduce framework running on the Spark platform. Experiments indicated that the designed model is capable of handling the very big data sets while state-of-the-art algorithms can only achieve good performance in small data sets.

37 citations


Journal ArticleDOI
01 May 2021
TL;DR: Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise and experiencing 9x growth since January 2020.
Abstract: Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.

Journal ArticleDOI
TL;DR: This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks and optimized the performance of the models by minimizing the prediction time.
Abstract: Currently, the Distributed Denial of Service (DDoS) attack has become rampant, and shows up in various shapes and patterns, therefore it is not easy to detect and solve with previous solutions. Classification algorithms have been used in many studies and have aimed to detect and solve the DDoS attack. DDoS attacks are performed easily by using the weaknesses of networks and by generating requests for services for software. Real-time detection of DDoS attacks is difficult to detect and mitigate, but this solution holds significant value as these attacks can cause big issues. This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models. We applied the two machine learning approaches Random Forest (RF) and Multi-Layer Perceptron (MLP) through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks. In addition to the detection of DoS attacks, we optimized the performance of the models by minimizing the prediction time as compared with other existing approaches using big data framework (Spark ML). We achieved a mean accuracy of 99.5% of the models both with and without big data approaches. However, in training and testing time, the big data approach outperforms the non-big data approach due to that the Spark computations in memory are in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively. Using a big data tool (Apache Spark), the maximum intermediate training and testing time in minutes was 34.11 and 0.46, respectively, using a non-big data approach. We also achieved these results using the big data approach. We can detect an attack in real-time in few milliseconds.

Journal ArticleDOI
TL;DR: A survey of hardware accelerators and hardware-aware algorithmic optimizations for 3D CNNs is presented and it is believed that this survey will spark a great deal of research towards the design of ultra-efficient 3DCNN accelerators of tomorrow.

Journal ArticleDOI
TL;DR: This paper addresses the problem of efficiently storing and querying spatio-temporal RDF data in parallel, by proposing the DiStRDF system, which is comprised of a Storage and a Processing Layer and uses Spark, a well-known distributed in-memory processing framework, as the underlying processing engine.
Abstract: The ever-increasing size of data emanating from mobile devices and sensors, dictates the use of distributed systems for storing and querying these data. Typically, such data sources provide some spatio-temporal information, alongside other useful data. The RDF data model can be used to interlink and exchange data originating from heterogeneous sources in a uniform manner. For example, consider the case where vessels report their spatio-temporal position, on a regular basis, by using various surveillance systems. In this scenario, a user might be interested to know which vessels were moving in a specific area for a given temporal range. In this paper, we address the problem of efficiently storing and querying spatio-temporal RDF data in parallel. We specifically study the case of SPARQL queries with spatio-temporal constraints, by proposing the DiStRDF system, which is comprised of a Storage and a Processing Layer. The DiStRDF Storage Layer is responsible for efficiently storing large amount of historical spatio-temporal RDF data of moving objects. On top of it, we devise our DiStRDF Processing Layer, which parses a SPARQL query and produces corresponding logical and physical execution plans. We use Spark, a well-known distributed in-memory processing framework, as the underlying processing engine. Our experimental evaluation, on real data from both aviation and maritime domains, demonstrates the efficiency of our DiStRDF system, when using various spatio-temporal range constraints.

Journal ArticleDOI
TL;DR: A novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO) is proposed, which utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities.
Abstract: Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

Journal ArticleDOI
TL;DR: This article proposes a fast Content-Based Image Retrieval system using Spark (CBIR-S) targeting large-scale images, and shows the effectiveness of the approach in terms of processing time.

Journal ArticleDOI
15 Jul 2021-Energy
TL;DR: In this paper, a customized liner with four side spark plugs was used to trigger controllable knock, through various spark strategies (e.g., spark number, timing, and location).

Journal ArticleDOI
TL;DR: An efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms is presented.
Abstract: Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.

Journal ArticleDOI
TL;DR: The Mondrian multidimensional anonymization method was developed and improved for satisfaction of the l-diversity privacy model, and it has been presented in a distributed fashion within the Apache Spark framework to resolve the problem of speed in large-scale data anonymization as it exists in the previous Hadoop-based algorithms.

DOI
13 Sep 2021
TL;DR: In this paper, the authors studied the effect of big data processing on NLP tasks based on a deep learning approach and compared the performance of BERT with the pipelines from Spark NLP.
Abstract: The rise of big data analytics on top of NLP increasing the computational burden for text processing at scale. The problems faced in NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all model's using BERT is 0.9187 and 35 minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%, while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.

Journal ArticleDOI
TL;DR: In this article, a kernel based fuzzy clustering approach is proposed to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space.

Journal ArticleDOI
TL;DR: This paper proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark, and uses the Hive and HBase principle of Data Lake as search engines for Hive andHBase in order to integrate the data.
Abstract: Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.

Journal ArticleDOI
TL;DR: In this paper, a fault diagnosis method of rolling bearing using Spark-based parallel ant colony optimization (ACO)-K-means clustering algorithm is proposed, which can not only achieve good fault diagnosis accuracy but also provide high model training efficiency and fault diagnosis efficiency in a big data environment.
Abstract: K-Means clustering algorithm is a typical unsupervised learning method, and it has been widely used in the field of fault diagnosis. However, the traditional serial K-Means clustering algorithm is difficult to efficiently and accurately perform clustering analysis on the massive running-state monitoring data of rolling bearing. Therefore, a novel fault diagnosis method of rolling bearing using Spark-based parallel ant colony optimization (ACO)-K-Means clustering algorithm is proposed. Firstly, a Spark-based three-layer wavelet packet decomposition approach is developed to efficiently preprocess the running-state monitoring data to obtain eigenvectors, which are stored in Hadoop Distributed File System (HDFS) and served as the input of ACO-K-Means clustering algorithm. Secondly, ACO-K-Means clustering algorithm suitable for rolling bearing fault diagnosis is proposed to improve the diagnosis accuracy. ACO algorithm is adopted to obtain the global optimal initial clustering centers of K-Means from all eigenvectors, and the K-Means clustering algorithm based on weighted Euclidean distance is used to perform clustering analysis on all eigenvectors to obtain a rolling bearing fault diagnosis model. Thirdly, the efficient parallelization of ACO-K-Means clustering algorithm is implemented on a Spark platform, which can make full use of the computing resources of a cluster to efficiently process large-scale rolling bearing datasets in parallel. Extensive experiments are conducted to verify the effectiveness of the proposed fault diagnosis method. Experimental results show that the proposed method can not only achieve good fault diagnosis accuracy but also provide high model training efficiency and fault diagnosis efficiency in a big data environment.

Journal ArticleDOI
01 Aug 2021-Fuel
TL;DR: In this paper, a specialized liner with installing four side spark plugs mounted on the cylinder head is used to produce various in-cylinder flame propagation, and various spark strategies (e.g., spark timing, spark number, spark location) are applied to generate different auto-ignition sites and knock characteristics.

Journal ArticleDOI
TL;DR: This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration, and uses the classic projective sampling technique to minimize the overhead of the modeling.

Book ChapterDOI
10 Jan 2021
TL;DR: This paper implemented a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark and achieved state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings.
Abstract: Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Journal ArticleDOI
TL;DR: This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed.
Abstract: Many big-data applications are batch applications that exploit dedicated frameworks to perform massively parallel computations across clusters of machines. The time needed to process the entirety of the inputs represents the application's response time, which can be subject to deadlines. Spark, probably the most famous incarnation of these frameworks today, allocates resources to applications statically at the beginning of the execution and deviations are not managed: to meet the applications’ deadlines, resources must be allocated carefully. This paper proposes an extension to Spark, called dynaSpark, that is able to allocate and redistribute resources to applications dynamically to meet deadlines and cope with the execution of unanticipated applications. This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed. Our evaluation shows that dynaSpark can (i) allocate resources efficiently to execute single applications with respect to set deadlines and (ii) reduce deadline violations (w.r.t. Spark) when executing multiple concurrent applications.

Journal ArticleDOI
TL;DR: The results showed that Spark performed better than Hadoop in the parallelization implementation of the recommended algorithm in the case of heterogeneous Spark clusters, and the HSATS adaptive task scheduling strategy reduced the completion time of the job and the utilization of cluster node resources was more reasonable.

Journal ArticleDOI
TL;DR: RDDs-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts is proposed and results provide high performance compared to the existing state-of-the-art privacy preserving approaches.
Abstract: Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.