Showing papers on "Spark (mathematics) published in 2021"

PDF

Open Access

Journal Article•DOI•

Big Data Driven Marine Environment Information Forecasting: A Time Series Prediction Network

[...]

Jiabao Wen¹, Jiachen Yang¹, Bin Jiang¹, Houbing Song², Huihui Wang³ - Show less +1 more•Institutions (3)

Tianjin University¹, Embry-Riddle Aeronautical University, Daytona Beach², Jacksonville University³

01 Jan 2021-IEEE Transactions on Fuzzy Systems

TL;DR: A semisupervised prediction model is proposed, which exploits the improved unsupervised clustering algorithm to establish the fuzzy partition function, and then utilize the neural network model to build the information prediction function.

...read moreread less

Abstract: The continuous development of industry big data technology requires better computing methods to discover the data value. Information forecast, as an important part of data mining technology, has achieved excellent applications in some industries. However, the existing deviation and redundancy in the data collected by the sensors make it difficult for some methods to accurately predict future information. This article proposes a semisupervised prediction model, which exploits the improved unsupervised clustering algorithm to establish the fuzzy partition function, and then utilize the neural network model to build the information prediction function. The main purpose of this article is to effectively solve the time analysis of massive industry data. In the experimental part, we built a data platform on Spark, and used some marine environmental factor datasets and UCI public datasets as analysis objects. Meanwhile, we analyzed the results of the proposed method compared with other traditional methods, and the running performance on the Spark platform. The results show that the proposed method achieved satisfactory prediction effect.

...read moreread less

104 citations

Journal Article•DOI•

A novel scalable intrusion detection system based on deep learning

[...]

Soosan Naderi Mighan¹, Mohsen Kahani¹•Institutions (1)

Ferdowsi University of Mashhad¹

01 Jun 2021-International Journal of Information Security

TL;DR: This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion detection by employing Apache Spark, as a big data processing tool, and proposes a hybrid scheme that combines the advantages of deep network and machine learning methods.

...read moreread less

Abstract: This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion detection It employs Apache Spark, as a big data processing tool, for processing a large size of network traffic data Also, we propose a hybrid scheme that combines the advantages of deep network and machine learning methods Initially, stacked autoencoder network is used for latent feature extraction, which is followed by several classification-based intrusion detection methods, such as support vector machine, random forest, decision trees, and naive Bayes which are used for fast and efficient detection of intrusion in massive network traffic data A real time UNB ISCX 2012 dataset is used to validate our proposed method and the performance is evaluated in terms of accuracy, f-measure, sensitivity, precision and time

...read moreread less

67 citations

Journal Article•DOI•

RS-DCNN: A novel distributed convolutional-neural-networks based-approach for big remote-sensing image classification

[...]

Wadii Boulila¹, Wadii Boulila², Mokhtar Sellami², Maha Driss², Maha Driss¹, Mohammed Al-Sarem¹, Mahmood Safaei³, Fuad A. Ghaleb⁴ - Show less +4 more•Institutions (4)

Taibah University¹, Manouba University², National University of Malaysia³, Universiti Teknologi Malaysia⁴

01 Mar 2021-Computers and Electronics in Agriculture

TL;DR: In this paper, a distributed convolutional-neural-networks (DCNN) based approach for big remote sensing image classification is proposed, which is the first study of its kind, which proposes a novel distributed deep learning-based approach for the classification of big Remote Sensing images.

...read moreread less

55 citations

Journal Article•DOI•

Machine learning assisted prediction of exhaust gas temperature of a heavy-duty natural gas spark ignition engine

[...]

Jinlong Liu¹, Qiao Huang¹, Christopher Ulishney¹, Cosmin E. Dumitrescu¹•Institutions (1)

West Virginia University¹

15 Oct 2021-Applied Energy

TL;DR: The results showed that well-trained machine learning models can complement more complex physical model while also helping with optimizing the engine performance, emissions, and life.

...read moreread less

53 citations

Journal Article•DOI•

Deep learning for early warning signals of tipping points.

[...]

Thomas M. Bury¹, Thomas M. Bury², R. I. Sujith³, Induja Pavithran³, Marten Scheffer⁴, Timothy M. Lenton⁵, Madhur Anand², Chris T. Bauch¹ - Show less +4 more•Institutions (5)

University of Waterloo¹, University of Guelph², Indian Institute of Technology Madras³, Wageningen University and Research Centre⁴, University of Exeter⁵

28 Sep 2021-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This paper developed a deep learning algorithm that provides early warning signals (EWS) in systems it was not explicitly trained on, by exploiting information about normal forms and scaling behavior of dynamics near tipping points that are common to many dynamical systems.

...read moreread less

Abstract: Many natural systems exhibit tipping points where slowly changing environmental conditions spark a sudden shift to a new and sometimes very different state. As the tipping point is approached, the dynamics of complex and varied systems simplify down to a limited number of possible "normal forms" that determine qualitative aspects of the new state that lies beyond the tipping point, such as whether it will oscillate or be stable. In several of those forms, indicators like increasing lag-1 autocorrelation and variance provide generic early warning signals (EWS) of the tipping point by detecting how dynamics slow down near the transition. But they do not predict the nature of the new state. Here we develop a deep learning algorithm that provides EWS in systems it was not explicitly trained on, by exploiting information about normal forms and scaling behavior of dynamics near tipping points that are common to many dynamical systems. The algorithm provides EWS in 268 empirical and model time series from ecology, thermoacoustics, climatology, and epidemiology with much greater sensitivity and specificity than generic EWS. It can also predict the normal form that characterizes the oncoming tipping point, thus providing qualitative information on certain aspects of the new state. Such approaches can help humans better prepare for, or avoid, undesirable state transitions. The algorithm also illustrates how a universe of possible models can be mined to recognize naturally occurring tipping points.

...read moreread less

53 citations

Journal Article•DOI•

Multi-class imbalanced big data classification on Spark

[...]

William C. Sleeman¹, Bartosz Krawczyk¹•Institutions (1)

Virginia Commonwealth University¹

05 Jan 2021-Knowledge Based Systems

TL;DR: This paper proposes the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data, and proposes an efficient implementation of the discussed algorithm on Apache Spark.

...read moreread less

Abstract: Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced .

...read moreread less

50 citations

Journal Article•DOI•

A big data approach to black Friday sales

[...]

Mazhar Javed Awan, Mohd Shafry Mohd Rahim, Haitham Nobanee, Awais Yasin, Osamah Ibrahim Khalaf, Umer Ishfaq - Show less +2 more

01 Jan 2021-Intelligent Automation and Soft Computing

TL;DR: This study aims to help retail companies create personalized deals and promotions for their customers, even during the COVID-19 pandemic, through a big data framework that allows them to handle massive sales volumes with more efficient models.

...read moreread less

Abstract: Retail companies recognize the need to analyze and predict their sales and customer behavior against their products and product categories Our study aims to help retail companies create personalized deals and promotions for their customers, even during the COVID-19 pandemic, through a big data framework that allows them to handle massive sales volumes with more efficient models In this paper, we used Black Friday sales data taken from a dataset on the Kaggle website, which contains nearly 550,000 observations analyzed with 10 features: Qualitative and quantitative The class label is purchases and sales (in U S dollars) Because the predictor label is continuous, regression models are suited in this case Using the Apache Spark big data framework, which uses the MLlib machine learning library, we trained two machine learning models: Linear regression and random forest These machine learning algorithms were used to predict future pricing and sales We first implemented a linear regression model and a random forest model without using the Spark framework and achieved accuracies of 68% and 74%, respectively Then, we trained these models on the Spark machine learning big data framework where we achieved an accuracy of 72% for the linear regression model and 81% for the random forest model © 2021, Tech Science Press All rights reserved

...read moreread less

45 citations

Journal Article•DOI•

Investigating the performance of Hadoop and Spark platforms on machine learning algorithms

[...]

Ali Mostafaeipour¹, Amir Jahangard Rafsanjani¹, Mohammad Hossein Ahmadi¹, Joshuva Arockia Dhanraj²•Institutions (2)

Yazd University¹, Hindustan University²

01 Feb 2021-The Journal of Supercomputing

TL;DR: The K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks, and the results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoops.

...read moreread less

Abstract: One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

...read moreread less

43 citations

Journal Article•DOI•

An Efficient Rolling Bearing Fault Diagnosis Method Based on Spark and Improved Random Forest Algorithm

[...]

Lanjun Wan¹, Kun Gong¹, Gen Zhang¹, Xinpan Yuan¹, Changyun Li¹, Xiaojun Deng¹ - Show less +2 more•Institutions (1)

Hunan University of Technology¹

04 Mar 2021-IEEE Access

TL;DR: Wang et al. as discussed by the authors proposed an efficient rolling bearing fault diagnosis method based on Spark and improved random forest (IRF) algorithm by eliminating the decision trees with low classification accuracy and those prone to repeated voting in the original RF, an improved RF with faster diagnosis speed and higher classification accuracy is constructed.

...read moreread less

Abstract: The random forest (RF) algorithm is a typical representative of ensemble learning, which is widely used in rolling bearing fault diagnosis. In order to solve the problems of slower diagnosis speed and repeated voting of traditional RF algorithm in rolling bearing fault diagnosis under the big data environment, an efficient rolling bearing fault diagnosis method based on Spark and improved random forest (IRF) algorithm is proposed. By eliminating the decision trees with low classification accuracy and those prone to repeated voting in the original RF, an improved RF with faster diagnosis speed and higher classification accuracy is constructed. For the massive rolling bearing vibration data, in order to improve the training speed and diagnosis speed of the rolling bearing fault diagnosis model, the IRF algorithm is parallelized on the Spark platform. First, an original RF model is obtained by training multiple decision trees in parallel. Second, the decision trees with low classification accuracy in the original RF model are filtered. Third, all path information of the reserved decision trees is obtained in parallel. Fourth, a decision tree similarity matrix is constructed in parallel to eliminate the decision trees which are prone to repeated voting. Finally, an IRF model which can diagnose rolling bearing faults quickly and effectively is obtained. A series of experiments are carried out to evaluate the effectiveness of the proposed rolling bearing fault diagnosis method based on Spark and IRF algorithm. The results show that the proposed method can not only achieve good fault diagnosis accuracy, but also have fast model training speed and fault diagnosis speed for large-scale rolling bearing datasets.

...read moreread less

39 citations

Journal Article•DOI•

Large-Scale High-Utility Sequential Pattern Analytics in Internet of Things

[...]

Gautam Srivastava¹, Jerry Chun-Wei Lin², Xuyun Zhang³, Yuanfa Li⁴•Institutions (4)

Brandon University¹, Bergen University College², Macquarie University³, Harbin Institute of Technology⁴

15 Aug 2021-IEEE Internet of Things Journal

TL;DR: A four-stage MapReduce framework that is solely based on the well-known Spark platform for use in high-utility sequential pattern mining is presented, shown to create a more efficient and faster mining performance for dealing with large data sets.

...read moreread less

Abstract: The concepts of sequential pattern mining have become a growing topic in data mining, finding a home most recently in the Internet of Things (IoT) where large volumes of data are presented by the second for analysis and knowledge extraction. One key topic within the realm of sequential pattern mining in high-utility sequential pattern mining (HUSPM), short form for high-utility sequential pattern mining. HUSPM takes into account the fusion of utility and sequence factors to assist in the determination of sequential patterns of high utility from databases and data sources. That being said, almost all current existing literature focus on only using a single machine to increase mining performance. In this work, we present a four-stage MapReduce framework that is solely based on the well-known Spark platform for use in HUSPM. This framework is shown to create a more efficient and faster mining performance for dealing with large data sets. It consists of four phases such as initialization, mining, updating, and generation phases to handle the big data sets based on the MapReduce framework running on the Spark platform. Experiments indicated that the designed model is capable of handling the very big data sets while state-of-the-art algorithms can only achieve good performance in small data sets.

...read moreread less

37 citations

Journal Article•DOI•

Spark NLP: Natural Language Understanding at Scale

[...]

Veysel Kocaman, David Talby

01 May 2021

TL;DR: Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise and experiencing 9x growth since January 2020.

...read moreread less

Abstract: Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.

...read moreread less

Journal Article•DOI•

Real-Time DDoS Attack Detection System Using Big Data Approach

[...]

Mazhar Javed Awan, Umar Farooq, Hafiz Muhammad Aqeel Babar, Awais Yasin, Haitham Nobanee, Muzammil Hussain, Owais Hakeem, Azlan Mohd Zain - Show less +4 more

27 Sep 2021-Sustainability

TL;DR: This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks and optimized the performance of the models by minimizing the prediction time.

...read moreread less

Abstract: Currently, the Distributed Denial of Service (DDoS) attack has become rampant, and shows up in various shapes and patterns, therefore it is not easy to detect and solve with previous solutions. Classification algorithms have been used in many studies and have aimed to detect and solve the DDoS attack. DDoS attacks are performed easily by using the weaknesses of networks and by generating requests for services for software. Real-time detection of DDoS attacks is difficult to detect and mitigate, but this solution holds significant value as these attacks can cause big issues. This paper addresses the prediction of application layer DDoS attacks in real-time with different machine learning models. We applied the two machine learning approaches Random Forest (RF) and Multi-Layer Perceptron (MLP) through the Scikit ML library and big data framework Spark ML library for the detection of Denial of Service (DoS) attacks. In addition to the detection of DoS attacks, we optimized the performance of the models by minimizing the prediction time as compared with other existing approaches using big data framework (Spark ML). We achieved a mean accuracy of 99.5% of the models both with and without big data approaches. However, in training and testing time, the big data approach outperforms the non-big data approach due to that the Spark computations in memory are in a distributed manner. The minimum average training and testing time in minutes was 14.08 and 0.04, respectively. Using a big data tool (Apache Spark), the maximum intermediate training and testing time in minutes was 34.11 and 0.46, respectively, using a non-big data approach. We also achieved these results using the big data approach. We can detect an attack in real-time in few milliseconds.

...read moreread less

Journal Article•DOI•

A survey of accelerator architectures for 3D convolution neural networks

[...]

Sparsh Mittal¹, Vibhu¹•Institutions (1)

Indian Institute of Technology Roorkee¹

01 May 2021-Journal of Systems Architecture

TL;DR: A survey of hardware accelerators and hardware-aware algorithmic optimizations for 3D CNNs is presented and it is believed that this survey will spark a great deal of research towards the design of ultra-efficient 3DCNN accelerators of tomorrow.

...read moreread less

Journal Article•DOI•

Parallel and scalable processing of spatio-temporal RDF queries using Spark

[...]

Panagiotis Nikitopoulos¹, Akrivi Vlachou¹, Christos Doulkeridis¹, George A. Vouros¹•Institutions (1)

University of Piraeus¹

01 Oct 2021-Geoinformatica

TL;DR: This paper addresses the problem of efficiently storing and querying spatio-temporal RDF data in parallel, by proposing the DiStRDF system, which is comprised of a Storage and a Processing Layer and uses Spark, a well-known distributed in-memory processing framework, as the underlying processing engine.

...read moreread less

Abstract: The ever-increasing size of data emanating from mobile devices and sensors, dictates the use of distributed systems for storing and querying these data. Typically, such data sources provide some spatio-temporal information, alongside other useful data. The RDF data model can be used to interlink and exchange data originating from heterogeneous sources in a uniform manner. For example, consider the case where vessels report their spatio-temporal position, on a regular basis, by using various surveillance systems. In this scenario, a user might be interested to know which vessels were moving in a specific area for a given temporal range. In this paper, we address the problem of efficiently storing and querying spatio-temporal RDF data in parallel. We specifically study the case of SPARQL queries with spatio-temporal constraints, by proposing the DiStRDF system, which is comprised of a Storage and a Processing Layer. The DiStRDF Storage Layer is responsible for efficiently storing large amount of historical spatio-temporal RDF data of moving objects. On top of it, we devise our DiStRDF Processing Layer, which parses a SPARQL query and produces corresponding logical and physical execution plans. We use Spark, a well-known distributed in-memory processing framework, as the underlying processing engine. Our experimental evaluation, on real data from both aviation and maritime domains, demonstrates the efficiency of our DiStRDF system, when using various spatio-temporal range constraints.

...read moreread less

Journal Article•DOI•

A Spark-based Apriori algorithm with reduced shuffle overhead

[...]

Shashi Raj, Dharavath Ramesh¹, Krishan Kumar Sethi¹•Institutions (1)

Indian Institutes of Technology¹

01 Jan 2021-The Journal of Supercomputing

TL;DR: A novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO) is proposed, which utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities.

...read moreread less

Abstract: Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

...read moreread less

Journal Article•DOI•

A parallel content-based image retrieval system using spark and tachyon frameworks

[...]

Saliha Mezzoudj, Ali Behloul, Rachid Seghir, Yassmina Saadna

01 Feb 2021-Journal of King Saud University - Computer and Information Sciences

TL;DR: This article proposes a fast Content-Based Image Retrieval system using Spark (CBIR-S) targeting large-scale images, and shows the effectiveness of the approach in terms of processing time.

...read moreread less

Journal Article•DOI•

Multiple spark plugs coupled with pressure sensors: A new approach for knock mechanism study on SI engines

[...]

Hao Shi¹, Kalim Uddeen¹, Yanzhao An², Yiqiang Pei², Bengt Johansson¹ - Show less +1 more•Institutions (2)

King Abdullah University of Science and Technology¹, Tianjin University²

15 Jul 2021-Energy

TL;DR: In this paper, a customized liner with four side spark plugs was used to trigger controllable knock, through various spark strategies (e.g., spark number, timing, and location).

...read moreread less

Journal Article•DOI•

An effective classification approach for big data with parallel generalized Hebbian algorithm

[...]

Ahmed Hussein Ali, Royida A. Ibrahem Alhayali¹, Mostafa Abdulghafoor Mohammed, Tole Sutikno²•Institutions (2)

University of Diyala¹, Universitas Ahmad Dahlan²

01 Dec 2021-Bulletin of Electrical Engineering and Informatics

TL;DR: An efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms is presented.

...read moreread less

Abstract: Advancements in information technology is contributing to the excessive rate of big data generation recently. Big data refers to datasets that are huge in volume and consumes much time and space to process and transmit using the available resources. Big data also covers data with unstructured and structured formats. Many agencies are currently subscribing to research on big data analytics owing to the failure of the existing data processing techniques to handle the rate at which big data is generated. This paper presents an efficient classification and reduction technique for big data based on parallel generalized Hebbian algorithm (GHA) which is one of the commonly used principal component analysis (PCA) neural network (NN) learning algorithms. The new method proposed in this study was compared to the existing methods to demonstrate its capabilities in reducing the dimensionality of big data. The proposed method in this paper is implemented using Spark Radoop platform.

...read moreread less

Journal Article•DOI•

DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark

[...]

Farough Ashkouti¹, Keyhan Khamforoosh¹, Amir Sheikhahmadi¹•Institutions (1)

Islamic Azad University¹

06 Feb 2021-Information Sciences

TL;DR: The Mondrian multidimensional anonymization method was developed and improved for satisfaction of the l-diversity privacy model, and it has been presented in a distributed fashion within the Apache Spark framework to resolve the problem of speed in large-scale data anonymization as it exists in the previous Hadoop-based algorithms.

...read moreread less

DOI•

Large-Scale News Classification using BERT Language Model: Spark NLP Approach

[...]

Kuncahyo Setyo Nugroho¹, Anantha Yullian Sukmadewa¹, Novanto Yudistira¹•Institutions (1)

University of Brawijaya¹

13 Sep 2021

TL;DR: In this paper, the authors studied the effect of big data processing on NLP tasks based on a deep learning approach and compared the performance of BERT with the pipelines from Spark NLP.

...read moreread less

Abstract: The rise of big data analytics on top of NLP increasing the computational burden for text processing at scale. The problems faced in NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all model's using BERT is 0.9187 and 35 minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%, while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.

...read moreread less

Journal Article•DOI•

Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis.

[...]

Preeti Jha¹, Aruna Tiwari¹, Neha Bharill², Milind B. Ratnaparkhe³, Mukkamalla Mounika¹, Neha Nagendra¹ - Show less +2 more•Institutions (3)

Indian Institute of Technology Indore¹, École Centrale Paris², Indian Council of Agricultural Research³

10 Feb 2021-Computational Biology and Chemistry

TL;DR: In this article, a kernel based fuzzy clustering approach is proposed to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space.

...read moreread less

Journal Article•DOI•

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

[...]

Chao-Tung Yang¹, Tzu-Yang Chen¹, Endah Kristiani², Endah Kristiani¹, Shyhtsun Felix Wu³ - Show less +1 more•Institutions (3)

Tunghai University¹, Krida Wacana Christian University², University of California, Davis³

01 Jun 2021-The Journal of Supercomputing

TL;DR: This paper proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark, and uses the Hive and HBase principle of Data Lake as search engines for Hive andHBase in order to integrate the data.

...read moreread less

Abstract: Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm.

...read moreread less

Journal Article•DOI•

A Novel Bearing Fault Diagnosis Method Using Spark-Based Parallel ACO-K-Means Clustering Algorithm

[...]

Lanjun Wan¹, Gen Zhang¹, Hongyang Li¹, Changyun Li¹•Institutions (1)

Hunan University of Technology¹

12 Feb 2021-IEEE Access

TL;DR: In this paper, a fault diagnosis method of rolling bearing using Spark-based parallel ant colony optimization (ACO)-K-means clustering algorithm is proposed, which can not only achieve good fault diagnosis accuracy but also provide high model training efficiency and fault diagnosis efficiency in a big data environment.

...read moreread less

Abstract: K-Means clustering algorithm is a typical unsupervised learning method, and it has been widely used in the field of fault diagnosis. However, the traditional serial K-Means clustering algorithm is difficult to efficiently and accurately perform clustering analysis on the massive running-state monitoring data of rolling bearing. Therefore, a novel fault diagnosis method of rolling bearing using Spark-based parallel ant colony optimization (ACO)-K-Means clustering algorithm is proposed. Firstly, a Spark-based three-layer wavelet packet decomposition approach is developed to efficiently preprocess the running-state monitoring data to obtain eigenvectors, which are stored in Hadoop Distributed File System (HDFS) and served as the input of ACO-K-Means clustering algorithm. Secondly, ACO-K-Means clustering algorithm suitable for rolling bearing fault diagnosis is proposed to improve the diagnosis accuracy. ACO algorithm is adopted to obtain the global optimal initial clustering centers of K-Means from all eigenvectors, and the K-Means clustering algorithm based on weighted Euclidean distance is used to perform clustering analysis on all eigenvectors to obtain a rolling bearing fault diagnosis model. Thirdly, the efficient parallelization of ACO-K-Means clustering algorithm is implemented on a Spark platform, which can make full use of the computing resources of a cluster to efficiently process large-scale rolling bearing datasets in parallel. Extensive experiments are conducted to verify the effectiveness of the proposed fault diagnosis method. Experimental results show that the proposed method can not only achieve good fault diagnosis accuracy but also provide high model training efficiency and fault diagnosis efficiency in a big data environment.

...read moreread less

Journal Article•DOI•

Statistical study on engine knock oscillation and heat release using multiple spark plugs and pressure sensors

[...]

Hao Shi¹, Kalim Uddeen¹, Yanzhao An², Yiqiang Pei², Bengt Johansson¹ - Show less +1 more•Institutions (2)

King Abdullah University of Science and Technology¹, Tianjin University²

01 Aug 2021-Fuel

TL;DR: In this paper, a specialized liner with installing four side spark plugs mounted on the cylinder head is used to produce various in-cylinder flame propagation, and various spark strategies (e.g., spark timing, spark number, spark location) are applied to generate different auto-ignition sites and knock characteristics.

...read moreread less

Journal Article•DOI•

Efficient Performance Prediction for Apache Spark

[...]

Guoli Cheng¹, Shi Ying¹, Bingming Wang¹, Yuhang Li¹•Institutions (1)

Wuhan University¹

01 Mar 2021-Journal of Parallel and Distributed Computing

TL;DR: This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration, and uses the classic projective sampling technique to minimize the overhead of the modeling.

...read moreread less

Book Chapter•DOI•

Biomedical Named Entity Recognition at Scale

[...]

Veysel Kocaman, David Talby

10 Jan 2021

TL;DR: This paper implemented a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark and achieved state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings.

...read moreread less

Abstract: Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

...read moreread less

Journal Article•DOI•

Fine-Grained Dynamic Resource Allocation for Big-Data Applications

[...]

Luciano Baresi¹, Alberto Leva¹, Giovanni Quattrocchi¹•Institutions (1)

Polytechnic University of Milan¹

01 Aug 2021-IEEE Transactions on Software Engineering

TL;DR: This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed.

...read moreread less

Abstract: Many big-data applications are batch applications that exploit dedicated frameworks to perform massively parallel computations across clusters of machines. The time needed to process the entirety of the inputs represents the application's response time, which can be subject to deadlines. Spark, probably the most famous incarnation of these frameworks today, allocates resources to applications statically at the beginning of the execution and deviations are not managed: to meet the applications’ deadlines, resources must be allocated carefully. This paper proposes an extension to Spark, called dynaSpark, that is able to allocate and redistribute resources to applications dynamically to meet deadlines and cope with the execution of unanticipated applications. This work is based on two key enablers: containers, to isolate Spark's parallel executors and allow for the dynamic and fast allocation of resources, and control-theory to govern resource allocation at runtime and obtain required precision and speed. Our evaluation shows that dynaSpark can (i) allocate resources efficiently to execute single applications with respect to set deadlines and (ii) reduce deadline violations (w.r.t. Spark) when executing multiple concurrent applications.

...read moreread less

Journal Article•DOI•

Apriori algorithm optimization based on Spark platform under big data

[...]

Huafeng Yu

01 Feb 2021-Microprocessors and Microsystems

TL;DR: The results showed that Spark performed better than Hadoop in the parallelization implementation of the recommended algorithm in the case of heterogeneous Spark clusters, and the HSATS adaptive task scheduling strategy reduced the completion time of the job and the utilization of cluster node resources was more reasonable.

...read moreread less

Journal Article•DOI•

Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark

[...]

Sibghat Ullah Bazai, Julian Jang-Jaccard, Hooman Alavizadeh

03 Mar 2021-Electronics

TL;DR: RDDs-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts is proposed and results provide high performance compared to the existing state-of-the-art privacy preserving approaches.

...read moreread less

Abstract: Data anonymization strategies such as subtree generalization have been hailed as techniques that provide a more efficient generalization strategy compared to full-tree generalization counterparts. Many subtree-based generalizations strategies (e.g., top-down, bottom-up, and hybrid) have been implemented on the MapReduce platform to take advantage of scalability and parallelism. However, MapReduce inherent lack support for iteration intensive algorithm implementation such as subtree generalization. This paper proposes Distributed Dataset (RDD)-based implementation for a subtree-based data anonymization technique for Apache Spark to address the issues associated with MapReduce-based counterparts. We describe our RDDs-based approach that offers effective partition management, improved memory usage that uses cache for frequently referenced intermediate values, and enhanced iteration support. Our experimental results provide high performance compared to the existing state-of-the-art privacy preserving approaches and ensure data utility and privacy levels required for any competitive data anonymization techniques.

...read moreread less

Big Data Processing Using Spark in Cloud

[...]

Mamta Mittal, Valentina Emilia Balas, Lalit Mohan Goyal, Raghvendra Kumar

09 Mar 2021

Collapse