scispace - formally typeset
Search or ask a question

Showing papers on "Spark (mathematics) published in 2020"


Book ChapterDOI
23 Aug 2020
TL;DR: In this article, the spatial-temporal sparse incremental perturbations are used to make the adversarial attack less perceptible. But, the work in this paper is different from previous work.
Abstract: Adversarial attacks of deep neural networks have been intensively studied on image, audio, and natural language classification tasks. Nevertheless, as a typical while important real-world application, the adversarial attacks of online video tracking that traces an object’s moving trajectory instead of its category are rarely explored. In this paper, we identify a new task for the adversarial attack to visual tracking: online generating imperceptible perturbations that mislead trackers along with an incorrect (Untargeted Attack, UA) or specified trajectory (Targeted Attack, TA). To this end, we first propose a spatial-aware basic attack by adapting existing attack methods, i.e., FGSM, BIM, and C&W, and comprehensively analyze the attacking performance. We identify that online object tracking poses two new challenges: 1) it is difficult to generate imperceptible perturbations that can transfer across frames, and 2) real-time trackers require the attack to satisfy a certain level of efficiency. To address these challenges, we further propose the spatial-aware online inc remental attac k (a.k.a. SPARK) that performs spatial-temporal sparse incremental perturbations online and makes the adversarial attack less perceptible. In addition, as an optimization-based method, SPARK quickly converges to very small losses within several iterations by considering historical incremental perturbations, making it much more efficient than basic attacks. The in-depth evaluation of the state-of-the-art trackers (i.e., SiamRPN++ with AlexNet, MobileNetv2, and ResNet-50, and SiamDW) on OTB100, VOT2018, UAV123, and LaSOT demonstrates the effectiveness and transferability of SPARK in misleading the trackers under both UA and TA with minor perturbations.

56 citations


Journal ArticleDOI
TL;DR: This work investigates existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation- based, experiment-driven, machine learning, and adaptive tuning.
Abstract: Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.

55 citations


Proceedings ArticleDOI
20 Apr 2020
TL;DR: This paper designs and implements a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL, which can efficiently manage big spatio-temporal data in a convenient way and has a competitive query performance and is much more scalable than other distributed data management systems.
Abstract: With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated con-stantly. To effectively support sophisticated urban applications, e.g., location-based services, based on spatio-temporal data, it is desirable for an efficient, scalable, update-enabled, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store, i.e., Apache HBase, as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T and XZ2T, which accelerates spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL. JUST also supports inherently new data insertions and historical data updates without index reconstruction. JUST is deployed as a PaaS in JD with multi-users support. Many applications have been developed based on the SDKs provided by JUST. Extensive experiments are carried out with six state-of-the-art distributed spatio-temporal data management systems based on two real datasets and one synthetic dataset. The results show that JUST has a competitive query performance and is much more scalable than them.

50 citations


Journal ArticleDOI
TL;DR: The experimental results indicate that the proposed distributed computing framework on Spark can forecast wind speed big data in multi-step accurately and has a faster computation speed when processing big data, compared to the stand-alone method.

45 citations


Journal ArticleDOI
TL;DR: The plan is to develop computational engine efficiency and improve rain prediction models successfully and effectively using big data and Hadoop learning, and the planned high timeliness and accuracy of real-time hurricane forecast with rain, can solve the problem.

44 citations


Journal ArticleDOI
TL;DR: A detailed model-based evaluation shows that SmartSSD has the potential to have a transformative impact when building a high performance data analytic system, which enables 3.04x performance improvement and consuming only 45.8 percent of energy compared to the conventional CPU-based approach.
Abstract: Faced with the increasing disparity between SSD throughput and CPU-based compute capabilities, there have been growing interests to move compute closer to storage and accelerate the data analytic workloads. In this letter, we propose SmartSSD, an SSD with onboard FPGA, which enables offloading computation within SSD. We perform a detailed model-based evaluation to evaluate the end-to-end performance and energy benefit of SmartSSD for the representative data analytic workloads with Spark SQL and Parquet columnar data format. Our evaluation shows that SmartSSD has the potential to have a transformative impact when building a high performance data analytic system, which enables 3.04x performance improvement and consuming only 45.8 percent of energy compared to the conventional CPU-based approach.

41 citations


Journal ArticleDOI
TL;DR: A hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis performed in the Cloud.
Abstract: In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training/application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identification accuracy (>99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high- or low-level programming languages. In light of the results obtained, both in terms of computation times and identification performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud.

39 citations


Journal ArticleDOI
TL;DR: An optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored and the processing speed of LDA topic model clustering has been improved based Spark.
Abstract: Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data, this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm under Spark is irreversible to word mapping, the mapped words indexes cannot be traced back to the original words. In this paper, an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored. Firstly, the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training. Finally, the text topic clustering is obtained. Experimental results show that for large data samples, the processing speed of LDA topic model clustering has been improved based Spark. At the same time, compared with the LDA topic model based on word frequency input, the model proposed in this paper has a reduction of perplexity.

38 citations


Journal ArticleDOI
TL;DR: This paper proposes and evaluates cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform in Hadoop and Spark and demonstrates the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.
Abstract: Nowadays, video cameras are increasingly used for surveillance, monitoring, and activity recording. These cameras generate high resolution image and video data at large scale. Processing such large scale video streams to extract useful information with time constraints is challenging. Traditional methods do not offer scalability to process large scale data. In this paper, we propose and evaluate cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform. These algorithms are often used as preprocessing steps for various high level tasks including object, anomaly, and activity recognition. We implement and evaluate both Canny edge detector and Hough transform algorithms in Hadoop and Spark. Our experimental evaluation using Spark shows an excellent scalability and performance compared to Hadoop and standalone implementations for both Canny edge detection and Hough transform. We obtained a speedup of 10.8$$\times$$ and 9.3$$\times$$ for Canny edge detection and Hough transform respectively using Spark. These results demonstrate the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.

37 citations


Journal ArticleDOI
TL;DR: By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion.

35 citations


Journal ArticleDOI
Lin Chen1, Jiaying Pan1, Changwen Liu1, Gequn Shu1, Haiqiao Wei1 
01 Feb 2020-Energy
TL;DR: In this paper, a double-spider system was used for investigating the influence of rapid combustion on engine performance and knocking characteristics, and the results showed that under synchronous double spark ignition condition, output power and effective thermal efficiency are improved because of shortened combustion duration.

Journal ArticleDOI
TL;DR: This paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method.
Abstract: Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

Journal ArticleDOI
TL;DR: The result of tests proves that ScienceEarth can efficiently store, retrieve, and process remote sensing data.
Abstract: Mass remote sensing data management and processing is currently one of the most important topics. In this study, we introduce ScienceEarth, a cluster-based data processing framework. The aim of ScienceEarth is to store, manage, and process large-scale remote sensing data in a cloud-based cluster-computing environment. The platform consists of the following three main parts: ScienceGeoData, ScienceGeoIndex, and ScienceGeoSpark. ScienceGeoData stores and manages remote sensing data. ScienceGeoIndex is an index and query system, a spatial index based on quad-tree and Hilbert curve which is combined for heterogeneous tiled remote sensing data that makes efficient data retrieval in ScienceGeoData. ScienceGeoSpark is an easy-to-use computing framework in which we use Apache Spark as the analytics engine for big remote sensing data processing. The result of tests proves that ScienceEarth can efficiently store, retrieve, and process remote sensing data. The results reveal ScienceEarth has the potential and capabilities of efficient big remote sensing data processing.

Journal ArticleDOI
TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

Journal ArticleDOI
TL;DR: In this article, the authors implement a novel remote sensing data flow (RESFlow) for advancing machine learning to compute with massive amounts of remotely sensed imagery, where the core contribution is partitioning a massive amount of data into homogeneous distributions for fitting simple models.
Abstract: The shear volumes of data generated from earth observation and remote sensing technologies continue to make major impact; leaping key geospatial applications into the dual data and compute-intensive era. As a consequence, this rapid advancement poses new computational and data processing challenges. We implement a novel remote sensing data flow (RESFlow) for advancing machine learning to compute with massive amounts of remotely sensed imagery. The core contribution is partitioning massive amounts of data into homogeneous distributions for fitting simple models. RESFlow takes advantage of Apache Spark and the availability of modern computing hardware to harness the acceleration of deep learning inference on expansive remote sensing imagery. The framework incorporates a strategy to optimize resource utilization across multiple executors assigned to a single worker. We showcase its deployment in both computationally and data-intensive workloads for pixel-level labeling tasks. The pipeline invokes deep learning inference at three stages; during deep feature extraction, deep metric mapping, and deep semantic segmentation. The tasks impose compute-intensive and GPU resource sharing challenges motivating for a parallelized pipeline for all execution steps. To address the problem of hardware resource contention, our containerized workflow further incorporates a novel GPU checkout routine and the ticketing system across multiple workers. The workflow is demonstrated with NVIDIA DGX accelerated platforms and offers appreciable compute speed-ups for deep learning inference on pixel labeling workloads; processing 21 028 TB of imagery data and delivering output maps at area rate of 5.245 sq.km/s, amounting to 453 168 sq.km/day—reducing a 28 day workload to 21 h.

Journal ArticleDOI
TL;DR: This work proposes a novel algorithm to forecast big data time series, based on the well-established Pattern Sequence-based Forecasting algorithm, which uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust.

Journal ArticleDOI
TL;DR: The proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further, and reduces the size of the input dataset for higher iterations enables EAFIM to perform better.
Abstract: Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

Posted Content
TL;DR: A single trainable NER model is presented that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT and can be extended to support other human languages with no code changes.
Abstract: Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Journal ArticleDOI
TL;DR: This research proposes to use Apache Spark to enhance the performance of a scalable stochastic optimization model for an MG for multiple buildings, and to ensure that a significant portion of the wind power output will be utilized.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: This paper proposes pruning algorithms for a variety of queries and implements the system, Cheetah, on a Barefoot Tofino switch and Spark to partially offload query computation to the switch.
Abstract: Modern database systems are growing increasingly distributed and struggle to reduce query completion time with a large volume of data. In this paper, we leverage programmable switches in the network to partially offload query computation to the switch. While switches provide high performance, they have resource and programming constraints that make implementing diverse queries difficult. To fit in these constraints, we introduce the concept of data pruning -- filtering out entries that are guaranteed not to affect output. The database system then runs the same query but on the pruned data, which significantly reduces processing time. We propose pruning algorithms for a variety of queries. We implement our system, Cheetah, on a Barefoot Tofino switch and Spark. Our evaluation on multiple workloads shows 40 - 200% improvement in the query completion time compared to Spark.

Journal ArticleDOI
TL;DR: A thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark and introduces Spark programming model and computing system, and discusses the pros and cons.
Abstract: With the explosive increase of big data, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset(RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient(faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation on various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for Spark.

Journal ArticleDOI
TL;DR: The results prove that the proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.
Abstract: Late detection and manual resolutions of performance anomalies in Cloud Computing and Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose an artificial neural network based methodology for anomaly detection tailored to the Apache Spark in-memory processing platform. Apache Spark is widely adopted by industry because of its speed and generality, however there is still a shortage of comprehensive performance anomaly detection methods applicable to this platform. We propose an artificial neural networks driven methodology to quickly sift through Spark logs data and operating system monitoring metrics to accurately detect and classify anomalous behaviors based on the Spark resilient distributed dataset characteristics. The proposed method is evaluated against three popular machine learning algorithms, decision trees, nearest neighbor, and support vector machine, as well as against four variants that consider different monitoring datasets. The results prove that our proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.

Journal ArticleDOI
TL;DR: This work introduces a novel method, called f-HMD, which aims at scalable hybrid model discovery in a cloud computing environment and returns hybrid process models to bridge the gap between formal and informal models.
Abstract: Process descriptions are used to create products and deliver services. To lead better processes and services, the first step is to learn a process model. Process discovery is such a technique which can automatically extract process models from event logs. Although various discovery techniques have been proposed, they focus on either constructing formal models which are very powerful but complex, or creating informal models which are intuitive but lack semantics. In this work, we introduce a novel method that returns hybrid process models to bridge this gap. Moreover, to cope with today's big event logs, we propose an efficient method, called f -HMD, aims at scalable hybrid model discovery in a cloud computing environment. We present the detailed implementation of our approach over the Spark framework, and our experimental results demonstrate that the proposed method is efficient and scalable.

Journal ArticleDOI
TL;DR: Different DL models for IDS on Apache Spark have been implemented and an enhanced model is used to improve attack detection accuracy and a computation delay comparison between Apache Spark and regular implementation is presented.
Abstract: Internet evolution produced a connected world with a massive amount of data. This connectivity advantage came with the price of more complex and advanced attacks. Intrusion Detection System (IDS) is an essential component for security in modern networks. The IDS methodology is either signature-based detection or anomaly behavior detection. Recently, researchers adopted Deep Learning (DL) because it has a better performance than traditional machine learning algorithms. The use of DL to produce a model for the IDS may take a long time because of computation complexity and a large number of hyperparameters. Different DL models for IDS on Apache Spark have been implemented in this article. This article uses the famous Network Security Lab - Knowledge Discovery and Data Mining (NSL-KDD) dataset and presents a computation delay comparison between Apache Spark and regular implementation. Moreover, an enhanced model is used to improve attack detection accuracy.

Journal ArticleDOI
TL;DR: A new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities is described.
Abstract: Because of the increasing volume of autonomously collected data objects, duplicate detection is an important challenge in today's data management. To evaluate the efficiency of duplicate detection algorithms with respect to big data, large test data sets are required. Existing test data generation tools, however, are either not able to produce large test data sets or are domain-dependent which limits their usefulness to a few cases. In this paper, we describe a new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities. To prove its concept, we implemented a prototype which is built upon the cluster computing framework Apache Spark and evaluate its performance in several experiments.

Proceedings ArticleDOI
18 Jan 2020
TL;DR: MLib, Spark library for machine learning algorithms, is utilized for distributed computing and the obtained results show that Spark produces high accuracy while parallelizing the process of load forecasting in highly competent training and test times.
Abstract: Load forecasting in smart grid is the process of predicting the amount of electrical power to meet the short, medium and long term demands. Accurate load forecasting helps electrical utilities to manage their energy production, operations, control and management. Most of the state-of-the-art forecasting methodologies utilize classical machine learning algorithms to predict the electrical load. There is a need that big data platforms and parallel distributed computing are utilized to their potential in the available solutions. In this paper, the Apache Spark and Apache Hadoop are utilized as big data platforms for distributed computing in order to predict the load using available big data. In this paper, MLib, Spark library for machine learning algorithms, is utilized for distributed computing. Using MLib allows testing the classic regression algorithms such as linear regression, generalized linear regression, decision tree, random forest and gradient-boosted trees in addition to survival regression and isotonic regression. The obtained results show that Spark produces high accuracy while parallelizing the process of load forecasting in highly competent training and test times. Actual big data are used in the load forecasting process.

Journal ArticleDOI
TL;DR: A fuzzy fault tree analysis approach based on similarity aggregation method (SAM-FFTA) has been proposed that combines SAM with fuzzy set theory and can handled comprehensively diverse forms of opinions of different experts to obtain the probabilities of bottom events in fault tree.
Abstract: Fault tree analysis (FTA) is an important method to analyze the failure causes of engineering systems and evaluate their safety and reliability. In practical application, the probabilities of bottom events in FTA are usually estimated according to the opinions of experts or engineers because it is difficult to obtain sufficient probability data of bottom events in fault tree. However, in many cases, there are many experts with different opinions or different forms of opinions. How to reasonably aggregate expert opinions is a challenge for the engineering application of fault tree method. In this study, a fuzzy fault tree analysis approach based on similarity aggregation method (SAM-FFTA) has been proposed. This method combines SAM with fuzzy set theory and can handled comprehensively diverse forms of opinions of different experts to obtain the probabilities of bottom events in fault tree. Finally, for verifying the applicability and flexibility of the proposed method, a natural gas spherical storage tank with a volume of 10,000 m3 was analyzed, and the importance of each bottom event was determined. The results show that flame, lightning spark, electrostatic spark, impact spark, mechanical breakdown and deformation/breakage have the most significant influence on the explosion of the natural gas spherical storage tank.

Journal ArticleDOI
TL;DR: The presented GA parallelization architecture outperforms the state-of-the-art reference architectures according to the computational experiments where the testing instances of traveling salesman problems are employed.

Journal ArticleDOI
TL;DR: This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores to manage distributed systems.
Abstract: For over a decade now we have been witnessing the success of massive parallel computation frameworks, such as MapReduce, Hadoop, Dryad, or Spark. Compared to the classic distributed algorithms or P...

Journal ArticleDOI
Wen Xiao1, Juan Hu
TL;DR: A distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat is proposed and implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing.
Abstract: Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.