Showing papers on "Spark (mathematics) published in 2019"

PDF

Open Access

Journal Article•DOI•

Spatial data management in apache spark: the GeoSpark perspective and beyond

[...]

Jia Yu¹, Zongsi Zhang¹, Mohamed Sarwat¹•Institutions (1)

15 Jan 2019-Geoinformatica

TL;DR: GeoSpark is presented, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale and achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems.

...read moreread less

Abstract: The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

...read moreread less

124 citations

Journal Article•DOI•

A Scalable and Hybrid Intrusion Detection System Based on the Convolutional-LSTM Network

[...]

Muhammad Ashfaq Khan, Md. Rezaul Karim, Yangwoo Kim

22 Apr 2019-Symmetry

TL;DR: This paper proposes a scalable and hybrid IDS, which is based on Spark ML and the convolutional-LSTM (Conv-L STM) network, and can identify network misuses accurately in 97.29% of cases and outperforms state-of-the-art approaches during 10-fold cross- validation tests.

...read moreread less

Abstract: With the rapid advancements of ubiquitous information and communication technologies, a large number of trustworthy online systems and services have been deployed. However, cybersecurity threats are still mounting. An intrusion detection (ID) system can play a significant role in detecting such security threats. Thus, developing an intelligent and accurate ID system is a non-trivial research problem. Existing ID systems that are typically used in traditional network intrusion detection system often fail and cannot detect many known and new security threats, largely because those approaches are based on classical machine learning methods that provide less focus on accurate feature selection and classification. Consequently, many known signatures from the attack traffic remain unidentifiable and become latent. Furthermore, since a massive network infrastructure can produce large-scale data, these approaches often fail to handle them flexibly, hence are not scalable. To address these issues and improve the accuracy and scalability, we propose a scalable and hybrid IDS, which is based on Spark ML and the convolutional-LSTM (Conv-LSTM) network. This IDS is a two-stage ID system: the first stage employs the anomaly detection module, which is based on Spark ML. The second stage acts as a misuse detection module, which is based on the Conv-LSTM network, such that both global and local latent threat signatures can be addressed. Evaluations of several baseline models in the ISCX-UNB dataset show that our hybrid IDS can identify network misuses accurately in 97.29% of cases and outperforms state-of-the-art approaches during 10-fold cross-validation tests.

...read moreread less

95 citations

Proceedings Article•DOI•

BigDL: A Distributed Deep Learning Framework for Big Data

[...]

Jason Dai¹, Yiheng Wang², Xin Qiu¹, Ding Ding¹, Yao Zhang¹, Yanzhang Wang¹, Xianyan Jia³, Cherry Li Zhang¹, Yan Wan³, Zhichao Li¹, Jiao Wang¹, Shengsheng Huang¹, Zhongyuan Wu¹, Yang Wang¹, Yuhao Yang¹, Bowen She¹, Dongjie Shi¹, Qi Lu¹, Kai Huang¹, Guoqiong Song¹ - Show less +16 more•Institutions (3)

Intel¹, Tencent², Alibaba Group³

20 Nov 2019

TL;DR: This paper presents BigDL (adistributeddeeplearning framework for Apache Spark), which allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management.

...read moreread less

Abstract: ThispaperpresentsBigDL (adistributeddeeplearning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and "war stories" of users that havead-optedBigDLtoaddresstheirchallenges(i.e., howtoeasilybuildend-to-enddataanalysisanddeep learning pipelines for their production data).

...read moreread less

77 citations

Journal Article•DOI•

Designing a Smart Transportation System: An Internet of Things and Big Data Approach

[...]

Bilal Jan, Haleem Farman¹, Murad Khan, Muhammad Talha², Ikram Ud Din³ - Show less +1 more•Institutions (3)

Islamia College University¹, King Saud University², University of Haripur³

22 Aug 2019-IEEE Wireless Communications

TL;DR: A model for analyzing transportation data with Hadoop along with Spark to handle real-time transportation data and can be used in generic vehicular network scenarios.

...read moreread less

Abstract: Big data analytics are widely used in many areas such as efficient designing and planning of smart transportation, smart control systems, smart cities, smart communities, and more. However, analyzing big data for smart control systems has many challenges and issues using conventional engineering techniques. These challenges include processing big data in real time, fast processing, and efficient decision and management. In this article, we design a model for analyzing transportation data with Hadoop along with Spark to handle real-time transportation data. The system is further divided into four layers: data collection and acquisition, network, data processing, and application. Each layer is designed in a way to process and manage data in a well-organized format. Similarly, the data is tested through Hadoop and Spark in the data processing layer. The data is disseminated to a smart community citizen using the proposed event and decision mechanism based on named data networking. The proposed system is tested for transportation datasets from various authentic sources. The results show processing of data and real-time dissemination with citizens in less possible time. The Hadoop ecosystem along with Spark generate highly accurate results. Further, the significance of the proposed architecture is that it can be used in generic vehicular network scenarios.

...read moreread less

72 citations

Journal Article•DOI•

On the Scalability of Machine-Learning Algorithms for Breast Cancer Prediction in Big Data Context

[...]

Sara Alghunaim¹, Heyam H. Al-Baity²•Institutions (2)

King Abdulaziz City for Science and Technology¹, King Saud University²

05 Jul 2019-IEEE Access

TL;DR: Three different classification algorithms, namely, support vector machine (SVM), decision tree, and random forest, are selected to create nine models that help in predicting breast cancer, and experimental results showed that the scaled SVM classifier in the Spark environment outperforms the other classifiers.

...read moreread less

Abstract: Recent advances in information technology have induced an explosive growth of data, creating a new era of big data. Unfortunately, traditional machine-learning algorithms cannot cope with the new characteristics of big data. In this paper, we address the problem of breast cancer prediction in the big data context. We considered two varieties of data, namely, gene expression (GE) and DNA methylation (DM). The objective of this paper is to scale up the machine-learning algorithms that are used for classification by applying each dataset separately and jointly. For this purpose, we chose Apache Spark as a platform. In this paper, we selected three different classification algorithms, namely, support vector machine (SVM), decision tree, and random forest, to create nine models that help in predicting breast cancer. We conducted a comprehensive comparative study using three scenarios with the GE, DM, and GE and DM combined, in order to show which of the three types of data would produce the best result in terms of accuracy and error rate. Moreover, we performed an experimental comparison between two platforms (Spark and Weka) in order to show their behavior when dealing with large sets of data. The experimental results showed that the scaled SVM classifier in the Spark environment outperforms the other classifiers, as it achieved the highest accuracy and the lowest error rate with the GE dataset.

...read moreread less

65 citations

Journal Article•DOI•

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

[...]

Shlomi Dolev¹, Patricia Gomes Soares Florissi, Ehud Gudes¹, Shantanu Sharma², Ido Singer - Show less +1 more•Institutions (2)

Ben-Gurion University of the Negev¹, University of California, Irvine²

01 Mar 2019-IEEE Transactions on Big Data

TL;DR: B batch processing, stream processing, MapReduce-based systems, and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues are classified and studied.

...read moreread less

Abstract: Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.

...read moreread less

57 citations

Journal Article•DOI•

A new Internet of Things architecture for real-time prediction of various diseases using machine learning on big data environment

[...]

Abderrahmane Ed-daoudy¹, Khalil Maalmi¹•Institutions (1)

Sidi Mohamed Ben Abdellah University¹

01 Dec 2019-Journal of Big Data

TL;DR: A new architecture for real-time health status prediction and analytics system using big data technologies and measures the performance of Spark DT against traditional machine learning tools including Weka to show the effectiveness of the proposed architecture.

...read moreread less

Abstract: A number of technologies enabled by Internet of Thing (IoT) have been used for the prevention of various chronic diseases, continuous and real-time tracking system is a particularly important one. Wearable medical devices with sensor, health cloud and mobile applications have continuously generating a huge amount of data which is often called as streaming big data. Due to the higher speed of the data generation, it is difficult to collect, process and analyze such massive data in real-time in order to perform real-time actions in case of emergencies and extracting hidden value. using traditional methods which are limited and time-consuming. Therefore, there is a significant need to real-time big data stream processing to ensure an effective and scalable solution. In order to overcome this issue, this work proposes a new architecture for real-time health status prediction and analytics system using big data technologies. The system focus on applying distributed machine learning model on streaming health data events ingested to Spark streaming through Kafka topics. Firstly, we transform the standard decision tree (DT) (C4.5) algorithm into a parallel, distributed, scalable and fast DT using Spark instead of Hadoop MapReduce which becomes limited for real-time computing. Secondly, this model is applied to streaming data coming from distributed sources of various diseases to predict health status. Based on several input attributes, the system predicts health status, send an alert message to care providers and store the details in a distributed database to perform health data analytics and stream reporting. We measure the performance of Spark DT against traditional machine learning tools including Weka. Finally, performance evaluation parameters such as throughput and execution time are calculated to show the effectiveness of the proposed architecture. The experimental results show that the proposed system is able to effectively process and predict real-time and massive amount of medical data enabled by IoT from distributed and various diseases.

...read moreread less

54 citations

Journal Article•DOI•

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

[...]

Roberto Corizzo¹, Michelangelo Ceci¹, Nathalie Japkowicz²•Institutions (2)

University of Bari¹, University of Washington²

01 Jul 2019-Big Data Research

TL;DR: A distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder, and a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations are proposed.

...read moreread less

49 citations

Proceedings Article•DOI•

Real-time machine learning for early detection of heart disease using big data approach

[...]

Abderrahmane Ed-daoudy¹, Khalil Maalmi¹•Institutions (1)

SIDI¹

03 Apr 2019

TL;DR: This paper proposes a real-time heart disease prediction system based on apache Spark which stand as a strong large scale distributed computing platform that can be used successfully for streaming data event against machine learning through in-memory computations.

...read moreread less

Abstract: Over the last few decades, heart disease is the most common cause of global death. So early detection of heart disease and continuous monitoring can reduce the mortality rate. The exponential growth of data from different sources such as wearable sensor devices used in Internet of Things health monitoring, streaming system and others have been generating an enormous amount of data on a continuous basis. The combination of streaming big data analytics and machine learning is a breakthrough technology that can have a significant impact in healthcare field especially early detection of heart disease. This technology can be more powerful and less expensive. To overcome this issue, this paper propose a real-time heart disease prediction system based on apache Spark which stand as a strong large scale distributed computing platform that can be used successfully for streaming data event against machine learning through in-memory computations. The system consists of two main sub parts, namely streaming processing and data storage and visualization. The first uses Spark MLlib with Spark streaming and applies classification model on data events to predict heart disease. The seconds uses Apache Cassandra for storing the large volume of generated data.

...read moreread less

48 citations

Journal Article•DOI•

Large-scale e-learning recommender system based on Spark and Hadoop

[...]

Karim Dahdouh¹, Ahmed Dakkak¹, Lahcen Oughdir¹, Abdelali Ibriz¹•Institutions (1)

Sidi Mohamed Ben Abdellah University¹

08 Jan 2019-Journal of Big Data

TL;DR: A distributed courses recommender system for the e-learning platform that aims to discover relationships between student’s activities using association rules method in order to help the student to choose the most appropriate learning materials.

...read moreread less

Abstract: The present work is a part of the ESTenLigne project which is the result of several years of experience for developing e-learning in Sidi Mohamed Ben Abdellah University through the implementation of open, online and adaptive learning environment. However, this platform faces many challenges, such as the increasing amount of data, the diversity of pedagogical resources and a large number of learners that makes harder to find what the learners are really looking for. Furthermore, most of the students in this platform are new graduates who have just come to integrate higher education and who need a system to help them to take the relevant courses that take into account the requirements and needs of each learner. In this article, we develop a distributed courses recommender system for the e-learning platform. It aims to discover relationships between student’s activities using association rules method in order to help the student to choose the most appropriate learning materials. We also focus on the analysis of past historical data of the courses enrollments or log data. The article discusses particularly the frequent itemsets concept to determine the interesting rules in the transaction database. Then, we use the extracted rules to find the catalog of more suitable courses according to the learner’s behaviors and preferences. Next, we deploy our recommender system using big data technologies and techniques. Especially, we implement parallel FP-growth algorithm provided by Spark Framework and Hadoop ecosystem. The experimental results show the effectiveness and scalability of the proposed system. Finally, we evaluate the performance of Spark MLlib library compared to traditional machine learning tools including Weka and R.

...read moreread less

46 citations

Posted Content•DOI•

Statistical Analysis of Spatial Expression Pattern for Spatially Resolved Transcriptomic Studies

[...]

Shiquan Sun¹, Shiquan Sun², Jiaqiang Zhu¹, Xiang Zhou¹•Institutions (2)

University of Michigan¹, Northwestern Polytechnical University²

21 Oct 2019-bioRxiv

TL;DR: The high power of SPARK allows us to identify new genes and pathways that reveal new biology in the data that otherwise cannot be revealed by existing approaches, up to ten times more powerful than existing approaches.

...read moreread less

Abstract: Recent development of various spatially resolved transcriptomic techniques has enabled gene expression profiling on complex tissues with spatial localization information. Identifying genes that display spatial expression pattern in these studies is an important first step towards characterizing the spatial transcriptomic landscape. Detecting spatially expressed genes requires the development of statistical methods that can properly model spatial count data, provide effective type I error control, have sufficient statistical power, and are computationally efficient. Here, we developed such a method, SPARK. SPARK directly models count data generated from various spatial resolved transcriptomic techniques through generalized linear spatial models. With a new efficient penalized quasi-likelihood based algorithm, SPARK is scalable to data sets with tens of thousands of genes measured on tens of thousands of samples. Importantly, SPARK relies on newly developed statistical formulas for hypothesis testing, producing well-calibrated p-values and yielding high statistical power. We illustrate the benefits of SPARK through extensive simulations and in-depth analysis of four published spatially resolved transcriptomic data sets. In the real data applications, SPARK is up to ten times more powerful than existing approaches. The high power of SPARK allows us to identify new genes and pathways that reveal new biology in the data that otherwise cannot be revealed by existing approaches.

...read moreread less

Journal Article•DOI•

DENCAST: distributed density-based clustering for multi-target regression

[...]

Roberto Corizzo¹, Gianvito Pio¹, Michelangelo Ceci¹, Donato Malerba¹•Institutions (1)

University of Bari¹

03 Jun 2019-Journal of Big Data

TL;DR: The DENCAST system is proposed, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction).

...read moreread less

Abstract: Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ ) state-of-the-art distributed regression methods, in both single and multi-target settings.

...read moreread less

Posted Content•DOI•

SparK: A Publication-quality NGS Visualization Tool

[...]

Stefan Kurtenbach¹, William Harbour J¹•Institutions (1)

University of Miami¹

16 Nov 2019-bioRxiv

TL;DR: SparK is presented, a tool which auto-generates publication-ready, high-resolution, true vector graphic figures from any NGS-based tracks, including RNA-seq, ChIP- seq, and ATAC-seq and is written in Python 3, making it executable on any major OS platform.

...read moreread less

Abstract: While there are sophisticated resources available for displaying NGS data, including the Integrative Genomics Viewer (IGV) and the UCSC genome browser, exporting regions and assembling figures for publication remains challenging. In particular, customizing track appearance and overlaying track replicates is a manual and time-consuming process. Here, we present SparK, a tool which auto-generates publication-ready, high-resolution, true vector graphic figures from any NGS-based tracks, including RNA-seq, ChIP-seq, and ATAC-seq. Novel functions of SparK include averaging of replicates, plotting standard deviation tracks, and highlighting significantly changed areas. SparK is written in Python 3, making it executable on any major OS platform. Using command line prompts to generate figures allows later changes to be made very easy. For instance, if the genomic region of the plot needs to be changed, or tracks need to be added or removed, the figure can easily be re-generated within seconds without the manual process of re-exporting and re-assembling everything. After plotting with SparK, changes to the output SVG vector graphic files are simple to make, including text, lines, and colors. SparK is publicly available on GitHub: https://github.com/harbourlab/SparK.

...read moreread less

Journal Article•DOI•

A Spark-Based Parallel Fuzzy $c$ -Means Segmentation Algorithm for Agricultural Image Big Data

[...]

Bin Liu¹, Songrui He¹, Dongjian He¹, Yin Zhang², Mohsen Guizani³ - Show less +1 more•Institutions (3)

Northwest A&F University¹, Zhongnan University of Economics and Law², University of Idaho³

26 Mar 2019-IEEE Access

TL;DR: It is indicated that the Spark-based parallel FCM algorithm provides faster speed of segmentation for agricultural image big data and has better scale-up and size-up rates.

...read moreread less

Abstract: With the explosive growth of image big data in the agriculture field, image segmentation algorithms are confronted with unprecedented challenges. As one of the most important images segmentation technologies, the fuzzy c-means (FCMs) algorithm has been widely used in the field of agricultural image segmentation as it provides simple computation and high-quality segmentation. However, due to its large amount of computation, the sequential FCM algorithm is too slow to finish the segmentation task within an acceptable time. This paper proposes a parallel FCM segmentation algorithm based on the distributed memory computing platform Apache Spark for agricultural image big data. The input image is first converted from the RGB color space to the lab color space and generates point cloud data. Then, point cloud data are partitioned and stored in different computing nodes, in which the membership degrees of pixel points to different cluster centers are calculated and the cluster centers are updated iteratively in a data-parallel form until the stopping condition is satisfied. Finally, point cloud data are restored after clustering for reconstructing the segmented image. On the Spark platform, the performance of the parallel FCMs algorithm is evaluated and reaches an average speedup of 12.54 on ten computing nodes. The experimental results show that the Spark-based parallel FCMs algorithm can obtain a significant increase in speedup, and the agricultural image testing set delivers a better performance improvement of 128% than the Hadoop-based approach. This paper indicates that the Spark-based parallel FCM algorithm provides faster speed of segmentation for agricultural image big data and has better scale-up and size-up rates.

...read moreread less

Journal Article•DOI•

Speedup your analytics: automatic parameter tuning for databases and big data systems

[...]

Jiaheng Lu¹, Yuxing Chen¹, Herodotos Herodotou², Shivnath Babu³•Institutions (3)

University of Helsinki¹, Cyprus University of Technology², Duke University³

01 Aug 2019

TL;DR: This tutorial describes the foundations of different automatic parameter tuning algorithms and present pros and cons of each approach, and identifies research challenges for handling cloud services, resource heterogeneity, and real-time analytics.

...read moreread less

Abstract: Database and big data analytics systems such as Hadoop and Spark have a large number of configuration parameters that control memory distribution, I/O optimization, parallelism, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators struggle to understand and tune them to achieve good performance. In this tutorial, we review existing approaches on automatic parameter tuning for databases, Hadoop, and Spark, which we classify into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We describe the foundations of different automatic parameter tuning algorithms and present pros and cons of each approach. We also highlight real-world applications and systems, and identify research challenges for handling cloud services, resource heterogeneity, and real-time analytics.

...read moreread less

Journal Article•DOI•

Finding tendencies in streaming data using Big Data frequent itemset mining

[...]

Carlos Fernandez-Basso¹, Abel J. Francisco-Agra¹, Maria J. Martin-Bautista¹, M. Dolores Ruiz²•Institutions (2)

University of Granada¹, University of Cádiz²

01 Jan 2019-Knowledge Based Systems

TL;DR: This work develops a frequent itemset mining method using sliding windows capable of extracting tendencies from continuous data flows using Big Data technologies, in particular, using the Spark Streaming framework enabling distributing the computation along several clusters and thus improving the algorithm speed.

...read moreread less

Abstract: The amount of information generated in social media channels or economical/business transactions exceeds the usual bounds of static databases and is in continuous growing. In this work, we propose a frequent itemset mining method using sliding windows capable of extracting tendencies from continuous data flows. For that aim, we develop this method using Big Data technologies, in particular, using the Spark Streaming framework enabling distributing the computation along several clusters and thus improving the algorithm speed. The experimentation carried out shows the capability of our proposal and its scalability when massive amounts of data coming from streams are taken into account.

...read moreread less

Journal Article•DOI•

Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction With Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

[...]

Weiping Ding¹, Chin-Teng Lin², Zehong Cao²•Institutions (2)

Nantong University¹, University of Technology, Sydney²

01 Jul 2019-IEEE Transactions on Neural Networks

TL;DR: A shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model that can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces.

...read moreread less

Abstract: The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces.

...read moreread less

Journal Article•DOI•

Distributed Correlation-Based Feature Selection in Spark

[...]

Raul-Jose Palma-Mendoza, Luis de-Marcos¹, Daniel Rodriguez¹, Amparo Alonso-Betanzos²•Institutions (2)

University of Alcalá¹, University of A Coruña²

01 Sep 2019-Information Sciences

TL;DR: DiCFS is described as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications, and able to handle larger datasets than the non-distributed WEKA version.

...read moreread less

Journal Article•DOI•

Fast fine-scale spark filamentation and its effect on the spark resistance

[...]

E. V. Parkevich, M. A. Medvedev, G. V. Ivanenkov, A. I. Khirianova, A. S. Selyukov, Alexey V. Agafonov, Ph. Korneev, S. Gus'kov, Albert R. Mingaleev - Show less +5 more

02 Sep 2019-Plasma Sources Science and Technology

Journal Article•DOI•

Review and Performance Evaluation of Fifty Alternative Liquid Fuels for Spark-Ignition Engines

[...]

Dominik Gschwend¹, Patrik Soltic², Alexander Wokaun¹, Frédéric Vogel¹•Institutions (2)

Paul Scherrer Institute¹, Swiss Federal Laboratories for Materials Science and Technology²

04 Feb 2019-Energy & Fuels

TL;DR: In this paper, a total of 50 different liquid fuel compounds were identified from the literature using a thermodynamic engine and used for spark-ignition engines with a total cost of about $1,000.

...read moreread less

Abstract: The currently discussed alternative fuels for spark-ignition engines are numerous. A total of 50 different liquid fuel compounds were identified from the literature. Using a thermodynamic engine mo...

...read moreread less

Journal Article•DOI•

LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with Spark

[...]

Siyang Lu¹, Xiang Wei², Xiang Wei¹, Bingbing Rao¹, Byung Chul Tak³, Long Wang⁴, Liqiang Wang¹ - Show less +3 more•Institutions (4)

University of Central Florida¹, Beijing Jiaotong University², Kyungpook National University³, IBM⁴

01 Jun 2019-Future Generation Computer Systems

TL;DR: This work proposes an innovative tool, named LADRA, for log-based abnormal tasks detection and root-cause analysis using Spark logs, and uses General Regression Neural Network (GRNN) to identify root causes for abnormal tasks.

...read moreread less

Proceedings Article•DOI•

RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark

[...]

Jiaqi Gu¹, Yugo H. Watanabe¹, William A. Mazza², Alexander Shkapsky, Mohan Yang³, Ling Ding¹, Carlo Zaniolo¹ - Show less +3 more•Institutions (3)

University of California, Los Angeles¹, University of Naples Federico II², Google³

25 Jun 2019

TL;DR: The RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

...read moreread less

Abstract: Thanks to a simple SQL extension, Recursive-aggregate-SQL (RaSQL) can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel compiler implementation allows RaSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this fixpoint operator leads to superior performance, scalability and portability. Thus, our RaSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

...read moreread less

Journal Article•DOI•

Model based optimization of the control strategy of a gasifier coupled with a spark ignition engine in a biomass powered cogeneration system

[...]

Michela Costa, Vittorio Rocco¹, C. Caputo¹, D. Cirillo, G. Di Blasio, M. La Villetta, G. Martoriello², Raffaele Tuccillo² - Show less +4 more•Institutions (2)

University of Rome Tor Vergata¹, University of Naples Federico II²

01 Sep 2019-Applied Thermal Engineering

TL;DR: In this paper, a numerical model of a real CHP plant of the micro-scale of power, the ECO20 manufactured by the Italian Company Costruzioni Motori Diesel S.p.A., is presented as coupled with an optimization algorithm for the search of the best performance in terms of electric power output.

...read moreread less

Journal Article•DOI•

Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

[...]

Bożena Małysiak-Mrozek, Tomasz Baron, Dariusz Mrozek

01 Jun 2019-Cluster Computing

TL;DR: This paper proposes Spark-based meta-predictor (Spark-IDPP), which enables efficient prediction of disordered regions of proteins on a large-scale and proves that through appropriate partitioning of data and by increasing the degree of parallelism, this method can significantly improve efficiency of IDP predictions.

...read moreread less

Abstract: Intrinsically disorder proteins (IDPs) constitute a significant part of proteins that exist and act in cells of living organisms. IDPs play key roles in central cellular processes and some of them are closely related to various human diseases, like cancer or neurodegenerative disorders. Identification of IDPs and studying their structural characteristics have become an important part of structural bioinformatics and structural genomics. However, growing amount of genomic and protein sequences in public repositories pose a pressure on existing methods for identification of IDPs. Large volumes of protein amino acid sequences need to be analyzed in terms of propensity to form disordered regions, and this task requires novel tools and scalable platforms to cope with this big biological data challenge. In this paper, we show how the identification of disordered regions of 3D protein structures can be efficiently accelerated with the use of Apache Spark cluster established and scaled on the public Cloud. For this purpose, we propose Spark-based meta-predictor (Spark-IDPP), which enables efficient prediction of disordered regions of proteins on a large-scale. Results of our performance tests show that, for large data sets, our method achieves almost linear speedup, when scaling out the computations on the 32-node Spark cluster located in the Azure cloud. This proves that through appropriate partitioning of data and by increasing the degree of parallelism, we can significantly improve efficiency of IDP predictions. Additionally, by using several basic predictors, aggregating their ranks in various consensus modes, and filtering the final outcome with a dedicated fuzzy filter, the Spark-IDPP increases the quality of predictions.

...read moreread less

Book Chapter•DOI•

Movie Recommender System Based on Collaborative Filtering Using Apache Spark

[...]

Mohammed Fadhel Aljunid¹, D. H. Manjaiah¹•Institutions (1)

Mangalore University¹

01 Jan 2019

TL;DR: This research focuses on the selection of parameters of ALS algorithms that can affect the performance of a building robust RS and proposes a movie recommender system based on ALS using Apache Spark.

...read moreread less

Abstract: Recently, the building of recommender systems becomes a significant research area that attractive several scientists and researchers across the world. The recommender systems are used in a variety of areas including music, movies, books, news, search queries, and commercial products. Collaborative Filtering algorithm is one of the popular successful techniques of RS, which aims to find users closely similar to the active one in order to recommend items. Collaborative filtering (CF) with alternating least squares (ALS) algorithm is the most imperative techniques which are used for building a movie recommendation engine. The ALS algorithm is one of the models of matrix factorization related CF which is considered as the values in the item list of user matrix. As there is a need to perform analysis on the ALS algorithm by selecting different parameters which can eventually help in building efficient movie recommender engine. In this paper, we propose a movie recommender system based on ALS using Apache Spark. This research focuses on the selection of parameters of ALS algorithms that can affect the performance of a building robust RS. From the results, a conclusion is drawn according to the selection of parameters of ALS algorithms which can affect the performance of building of a movie recommender engine. The model evaluation is done using different metrics such as execution time, root mean squared error (RMSE) of rating prediction, and rank in which the best model was trained. Two best cases are chosen based on best parameters selection from experimental results which can lead to building good prediction rating for a movie recommender.

...read moreread less

Journal Article•DOI•

A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray

[...]

Behrooz Koohmareh Hosseini¹, Kourosh Kiani¹•Institutions (1)

Semnan University¹

01 Mar 2019-Engineering Applications of Artificial Intelligence

TL;DR: By proposing a Resilient Distributed Dataset (RDD) localized subclustering method, disk I/O burden of the MapReduce based clustering approaches has been solved and the comparison of the clustering results with similar works shows the superiority of the proposed algorithm in precision and cluster validity indexes.

...read moreread less

Journal Article•DOI•

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

[...]

Liu Peng¹, Zhao Huihan¹, Jia-yu Teng, Yang Yanyan¹, Ya-feng Liu¹, Zong-wei Zhu² - Show less +2 more•Institutions (2)

China University of Mining and Technology¹, University of Science and Technology of China²

04 Oct 2019-Journal of Central South University

TL;DR: The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.

...read moreread less

Abstract: The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.

...read moreread less

Journal Article•DOI•

An effective distributed predictive model with Matrix factorization and random forest for Big Data recommendation systems

[...]

Badr Ait Hammou¹, Ayoub Ait Lahcen², Ayoub Ait Lahcen¹, Salma Mouline¹•Institutions (2)

Mohammed V University¹, Ibn Tofail University²

15 Dec 2019-Expert Systems With Applications

TL;DR: Experimental results demonstrate that the proposed novel distributed recommendation solution based on Apache Spark is able to significantly speed up the distributed training, as well as improve the performance in the context of Big Data.

...read moreread less

Abstract: Recommendation systems have been widely deployed to address the challenge of overwhelming information. They are used to enable users to find interesting information from a large volume of data. However, in the era of Big Data, as data become larger and more complicated, a recommendation algorithm that runs in a traditional environment cannot be fast and effective. It requires a high computational cost for performing the training task, which may limit its applicability in real-world Big Data applications. In this paper, we propose a novel distributed recommendation solution for Big Data. It is designed based on Apache Spark to handle large-scale data, improve the prediction quality, and address the data sparsity problem. In particular, thanks to a novel learning process, the model is able to significantly speed up the distributed training, as well as improve the performance in the context of Big Data. Experimental results on three real-world data sets demonstrate that our proposal outperforms existing recommendation methods in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and computational time.

...read moreread less

Journal Article•DOI•

Socio-cyber network: The potential of cyber-physical system to define human behaviors using big data analytics

[...]

Awais Ahmad¹, Muhammad Ali Babar², Sadia Din³, Shehzad Khalid⁴, Muhammad Mazhar Ullah³, Anand Paul³, Alavalapati Goutham Reddy⁵, Nasro Min-Allah⁶ - Show less +4 more•Institutions (6)

Yeungnam University¹, University of the Sciences², Kyungpook National University³, Bahria University⁴, Sejong University⁵, University of Dammam⁶

01 Mar 2019-Future Generation Computer Systems

TL;DR: A novel notion of ‘Socio-Cyber Network’ is derived, where a friendship is made based on the geo-location information of the user, where trust index is used based on graphs theory, which provides a better understanding of extraction knowledge from the data and finding relationship between different users.

...read moreread less

Journal Article•DOI•

Spark ignition and early flame development of lean mixtures under high-velocity flow conditions: An experimental study:

[...]

Shogo Sayama¹, Masao Kinoshita¹, Yoshiyuki Mandokoro¹, Takayuki Fuyuto¹•Institutions (1)

Toyota¹

01 Feb 2019-International Journal of Engine Research

TL;DR: In this paper, the authors experimentally investigate spark ignition and the subsequent early flame development of lean air-fuel mixtures of A/F under high-velocity flow conditions using a uniqu...

...read moreread less

Abstract: This study set out to experimentally investigate spark ignition and the subsequent early flame development of lean air–fuel mixtures of A/F = 20–30 under high-velocity flow conditions using a uniqu...

...read moreread less

Collapse