Showing papers in &quot;Journal of Big Data in 2021&quot;

Text Data Augmentation for Deep Learning.

TL;DR: In this paper, a comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field is provided, and the challenges and suggested solutions to help researchers understand the existing research gaps.

...read moreread less

Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

...read moreread less

1,084 citations

Journal Article•DOI•

[...]

Connor Shorten¹, Taghi M. Khoshgoftaar¹, Borko Furht¹•Institutions (1)

Deep Learning applications for COVID-19

29 Jun 2021-Journal of Big Data

TL;DR: A survey of data augmentation for text data can be found in this article, where the major motifs of Data Augmentation are summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form.

...read moreread less

Abstract: Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

...read moreread less

487 citations

Journal Article•DOI•

[...]

Connor Shorten¹, Taghi M. Khoshgoftaar¹, Borko Furht¹•Institutions (1)

A survey on missing data in machine learning.

TL;DR: In this paper, a survey explores how deep learning has been used in combating the COVID-19 pandemic and provides directions for future research on the field of deep learning in computer vision, natural language processing, computer vision and epidemiology.

...read moreread less

Abstract: This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tasks are constructed. We begin by evaluating the current state of Deep Learning and conclude with key limitations of Deep Learning for COVID-19 applications. These limitations include Interpretability, Generalization Metrics, Learning from Limited Labeled Data, and Data Privacy. Natural Language Processing applications include mining COVID-19 research for Information Retrieval and Question Answering, as well as Misinformation Detection, and Public Sentiment Analysis. Computer Vision applications cover Medical Image Analysis, Ambient Intelligence, and Vision-based Robotics. Within Life Sciences, our survey looks at how Deep Learning can be applied to Precision Diagnostics, Protein Structure Prediction, and Drug Repurposing. Deep Learning has additionally been utilized in Spread Forecasting for Epidemiology. Our literature review has found many examples of Deep Learning systems to fight COVID-19. We hope that this survey will help accelerate the use of Deep Learning for COVID-19 research.

...read moreread less

139 citations

Posted Content•DOI•

[...]

Tlamelo Emmanuel¹, Thabiso M. Maupong¹, Dimane Mpoeleng¹, Thabo Semong¹, Banyatsang Mphago¹, Oteng Tabona¹ - Show less +2 more•Institutions (1)

Botswana International University of Science and Technology¹

17 Jun 2021-Journal of Big Data

TL;DR: This paper aggregates some of the literature on missing data particularly focusing on machine learning techniques, and gives insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for.

...read moreread less

Abstract: Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

...read moreread less

138 citations

Journal Article•DOI•

Comparative analysis of deep learning image detection algorithms

[...]

Shrey Srivastava¹, Amit Vishvas Divekar¹, Chandu Anilkumar¹, Ishika Naik¹, Ved Kulkarni¹, V. Pattabiraman¹ - Show less +2 more•Institutions (1)

VIT University¹

A survey on data‐efficient algorithms in big data era

TL;DR: This project presents a comparative analysis of 3 major image processing algorithms: SSD, Faster R-CNN, and YOLO and evaluated the performance and accuracy of the three algorithms and analysed their strengths and weaknesses.

...read moreread less

Abstract: A computer views all kinds of visual media as an array of numerical values. As a consequence of this approach, they require image processing algorithms to inspect contents of images. This project compares 3 major image processing algorithms: Single Shot Detection (SSD), Faster Region based Convolutional Neural Networks (Faster R-CNN), and You Only Look Once (YOLO) to find the fastest and most efficient of three. In this comparative analysis, using the Microsoft COCO (Common Object in Context) dataset, the performance of these three algorithms is evaluated and their strengths and limitations are analysed based on parameters such as accuracy, precision and F1 score. From the results of the analysis, it can be concluded that the suitability of any of the algorithms over the other two is dictated to a great extent by the use cases they are applied in. In an identical testing environment, YOLO-v3 outperforms SSD and Faster R-CNN, making it the best of the three algorithms.

...read moreread less

99 citations

Journal Article•DOI•

[...]

Amina Adadi

Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation

TL;DR: In this paper, the authors present a comprehensive review of existing data-efficient methods and systematizes them into four categories: creating more data, transferring knowledge from rich data domains into poor data domains, altering data-hungry algorithms to reduce their dependency upon the amount of samples, or transferring knowledge between rich and poor domains.

...read moreread less

Abstract: The leading approaches in Machine Learning are notoriously data-hungry Unfortunately, many application domains do not have access to big data because acquiring data involves a process that is expensive or time-consuming This has triggered a serious debate in both the industrial and academic communities calling for more data-efficient models that harness the power of artificial learners while achieving good results with less training data and in particular less human supervision In light of this debate, this work investigates the issue of algorithms’ data hungriness First, it surveys the issue from different perspectives Then, it presents a comprehensive review of existing data-efficient methods and systematizes them into four categories Specifically, the survey covers solution strategies that handle data-efficiency by (i) using non-supervised algorithms that are, by nature, more data-efficient, by (ii) creating artificially more data, by (iii) transferring knowledge from rich-data domains into poor-data domains, or by (iv) altering data-hungry algorithms to reduce their dependency upon the amount of samples, in a way they can perform well in small samples regime Each strategy is extensively reviewed and discussed In addition, the emphasis is put on how the four strategies interplay with each other in order to motivate exploration of more robust and data-efficient algorithms Finally, the survey delineates the limitations, discusses research challenges, and suggests future opportunities to advance the research on data-efficiency in machine learning

...read moreread less

65 citations

Journal Article•DOI•

[...]

Abeer Mohsin Saleh¹, Talal Hamoud¹•Institutions (1)

Damascus University¹

Resampling imbalanced data for network intrusion detection datasets

TL;DR: In this article, a deep convolution neural network (CNN) was modified and adapted for person recognition with Image Augmentation (IA) technique depending on gait features, which improved the accuracy of person recognition using gait model comparing to model without adaptation.

...read moreread less

Abstract: Person Recognition based on Gait Model (PRGM) and motion features is are indeed a challenging and novel task due to their usages and to the critical issues of human pose variation, human body occlusion, camera view variation, etc. In this project, a deep convolution neural network (CNN) was modified and adapted for person recognition with Image Augmentation (IA) technique depending on gait features. Adaptation aims to get best values for CNN parameters to get best CNN model. In Addition to the CNN parameters Adaptation, the design of CNN model itself was adapted to get best model structure; Adaptation in the design was affected the type, the number of layers in CNN and normalization between them. After choosing best parameters and best design, Image augmentation was used to increase the size of train dataset with many copies of the image to boost the number of different images that will be used to train Deep learning algorithms. The tests were achieved using known dataset (Market dataset). The dataset contains sequential pictures of people in different gait status. The image in CNN model as matrix is extracted to many images or matrices by the convolution, so dataset size may be bigger by hundred times to make the problem a big data issue. In this project, results show that adaptation has improved the accuracy of person recognition using gait model comparing to model without adaptation. In addition, dataset contains images of person carrying things. IA technique improved the model to be robust to some variations such as image dimensions (quality and resolution), rotations and carried things by persons. Results for 200 persons recognition, validation accuracy was about 82% without IA and 96.23 with IA. For 800 persons recognition, validation accuracy was 93.62% without IA.

...read moreread less

64 citations

Journal Article•DOI•

[...]

Sikha Bagui¹, Kunqi Li¹•Institutions (1)

University of West Florida¹

A survey on generative adversarial networks for imbalance problems in computer vision tasks

TL;DR: In this paper, resampling is used to adjust the ratio between the different classes, making the data more balanced and improving the performance of artificial neural network multi-class classifiers.

...read moreread less

Abstract: Machine learning plays an increasingly significant role in the building of Network Intrusion Detection Systems However, machine learning models trained with imbalanced cybersecurity data cannot recognize minority data, hence attacks, effectively One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced This research looks at resampling’s influence on the performance of Artificial Neural Network multi-class classifiers The resampling methods, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with Synthetic Minority Oversampling Technique, and random undersampling with Adaptive Synthetic Sampling Method were used on benchmark Cybersecurity datasets, KDD99, UNSW-NB15, UNSW-NB17 and UNSW-NB18 Macro precision, macro recall, macro F1-score were used to evaluate the results The patterns found were: First, oversampling increases the training time and undersampling decreases the training time; second, if the data is extremely imbalanced, both oversampling and undersampling increase recall significantly; third, if the data is not extremely imbalanced, resampling will not have much of an impact; fourth, with resampling, mostly oversampling, more of the minority data (attacks) were detected

...read moreread less

62 citations

Journal Article•DOI•

[...]

Vignesh Sampath¹, Iñaki Maurtua, Juan José Aguilar Martín¹, Aitor Gutierrez•Institutions (1)

University of Zaragoza¹

A novel community detection based genetic algorithm for feature selection

TL;DR: In this article, the authors examine the most recent developments of GANs based techniques for addressing imbalance problems in image data and propose a taxonomy to summarize GAN-based techniques.

...read moreread less

Abstract: Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms.

...read moreread less

61 citations

Journal Article•DOI•

[...]

Mehrdad Rostami¹, Kamal Berahmand², Saman Forouzandeh³•Institutions (3)

University of Kurdistan¹, Queensland University of Technology², University of Applied Science and Technology³

Data science approach to stock prices forecasting in Indonesia during Covid-19 using Long Short-Term Memory (LSTM)

TL;DR: In this paper, the authors proposed a genetic algorithm based on community detection, which functions in three steps, where feature similarities are calculated in the first step and features are classified by community detection algorithms into clusters throughout the second step In the third step, features are picked by a GA with a new community-based repair operation.

...read moreread less

Abstract: The feature selection is an essential data preprocessing stage in data mining The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps The feature similarities are calculated in the first step The features are classified by community detection algorithms into clusters throughout the second step In the third step, features are picked by a genetic algorithm with a new community-based repair operation Nine benchmark classification problems were analyzed in terms of the performance of the presented approach Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection Comparing the performance of the proposed method with three new feature selection methods based on PSO, ACO, and ABC algorithms on three classifiers showed that the accuracy of the proposed method is on average 052% higher than the PSO, 120% higher than ACO, and 157 higher than the ABC algorithm

...read moreread less

58 citations

Journal Article•DOI•

[...]

Widodo Budiharto¹•Institutions (1)

Binus University¹

A survey on artificial intelligence assurance

TL;DR: In this paper, a data science model for stock prices forecasting in Indonesian exchange based on the statistical computing based on R language and Long Short-Term Memory (LSTM) was proposed.

...read moreread less

Abstract: Stock market process is full of uncertainty; hence stock prices forecasting very important in finance and business. For stockbrokers, understanding trends and supported by prediction software for forecasting is very important for decision making. This paper proposes a data science model for stock prices forecasting in Indonesian exchange based on the statistical computing based on R language and Long Short-Term Memory (LSTM). The first Covid-19 (Coronavirus disease-19) confirmed case in Indonesia is on 2 March 2020. After that, the composite stock price index has plunged 28% since the start of the year and the share prices of cigarette producers and banks in the midst of the corona pandemic reached their lowest value on March 24, 2020. We use the big data from Bank of Central Asia (BCA) and Bank of Mandiri from Indonesia obtained from Yahoo finance. In our experiments, we visualize the data using data science and predict and simulate the important prices called Open, High, Low and Closing (OHLC) with various parameters. Based on the experiment, data science is very useful for visualization data and our proposed method using Long Short-Term Memory (LSTM) can be used as predictor in short term data with accuracy 94.57% comes from the short term (1 year) with high epoch in training phase rather than using 3 years training data.

...read moreread less

Journal Article•DOI•

[...]

Feras A. Batarseh¹, Laura J. Freeman¹, Chih-Hao Huang²•Institutions (2)

Virginia Tech¹, George Mason University²

A literature review on one-class classification and its potential applications in big data

TL;DR: In this article, the authors provide a systematic review of research works that are relevant to AI assurance, between years 1985 and 2021, and aim to provide a structured alternative to the landscape.

...read moreread less

Abstract: Artificial Intelligence (AI) algorithms are increasingly providing decision making and operational support across multiple domains. AI includes a wide (and growing) library of algorithms that could be applied for different problems. One important notion for the adoption of AI algorithms into operational decision processes is the concept of assurance. The literature on assurance, unfortunately, conceals its outcomes within a tangled landscape of conflicting approaches, driven by contradicting motivations, assumptions, and intuitions. Accordingly, albeit a rising and novel area, this manuscript provides a systematic review of research works that are relevant to AI assurance, between years 1985 and 2021, and aims to provide a structured alternative to the landscape. A new AI assurance definition is adopted and presented, and assurance methods are contrasted and tabulated. Additionally, a ten-metric scoring system is developed and introduced to evaluate and compare existing methods. Lastly, in this manuscript, we provide foundational insights, discussions, future directions, a roadmap, and applicable recommendations for the development and deployment of AI assurance.

...read moreread less

Journal Article•DOI•

[...]

Naeem Seliya¹, Azadeh Abdollah Zadeh², Taghi M. Khoshgoftaar²•Institutions (2)

University of Wisconsin–Eau Claire¹, Florida Atlantic University²

Stable bagging feature selection on medical data

TL;DR: One-class classification (OCC) as mentioned in this paper is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data.

...read moreread less

Abstract: In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data. We present a detailed survey of OCC-related literature works published over the last decade, approximately. We group the different works into three categories: outlier detection, novelty detection, and deep learning and OCC. We closely examine and evaluate selected works on OCC such that a good cross section of approaches, methods, and application domains is represented in the survey. Commonly used techniques in OCC for outlier detection and for novelty detection, respectively, are discussed. We observed one area that has been largely omitted in OCC-related literature is its application context for big data and its inherently associated problems, such as severe class imbalance, class rarity, noisy data, feature selection, and data reduction. We feel the survey will be appreciated by researchers working in these areas of big data.

...read moreread less

Journal Article•DOI•

[...]

Salem Alelyani¹•Institutions (1)

King Khalid University¹

A novel multi-source information-fusion predictive framework based on deep neural networks for accuracy enhancement in stock market prediction

TL;DR: This paper calls this problem-specific a class, which is conducted in relation to a specific problem, such as collecting human genomes from patients for a particular disease, gathering social media data for gender identification, or crawling websites for offensive materials.

...read moreread less

Abstract: In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

...read moreread less

Journal Article•DOI•

[...]

Isaac Kofi Nti¹, Adebayo Felix Adekoya¹, Benjamin Asubam Weyori¹•Institutions (1)

University Of Energy And Natural Resources¹

Intrusion detection systems using long short-term memory (LSTM)

TL;DR: In this paper, the authors proposed a multi-source information-fusion stock price prediction framework based on a hybrid deep neural network architecture (Convolution Neural Networks (CNN) and Long Short-Term Memory (LSTM) for market analysis.

...read moreread less

Abstract: The stock market is very unstable and volatile due to several factors such as public sentiments, economic factors and more Several Petabytes volumes of data are generated every second from different sources, which affect the stock market A fair and efficient fusion of these data sources (factors) into intelligence is expected to offer better prediction accuracy on the stock market However, integrating these factors from different data sources as one dataset for market analysis is seen as challenging because they come in a different format (numerical or text) In this study, we propose a novel multi-source information-fusion stock price prediction framework based on a hybrid deep neural network architecture (Convolution Neural Networks (CNN) and Long Short-Term Memory (LSTM)) named IKN-ConvLSTM Precisely, we design a predictive framework to integrate stock-related information from six (6) heterogeneous sources Secondly, we construct a base model using CNN, and random search algorithm as a feature selector to optimise our initial training parameters Finally, a stacked LSTM network is fine-tuned by using the tuned parameter (features) from the base-model to enhance prediction accuracy Our approach's emperical evaluation was carried out with stock data (January 3, 2017, to January 31, 2020) from the Ghana Stock Exchange (GSE) The results show a good prediction accuracy of 9831%, specificity (09975), sensitivity (08939%) and F-score (09672) of the amalgamated dataset compared with the distinct dataset Based on the study outcome, it can be concluded that efficient information fusion of different stock price indicators as a single data source for market prediction offer high prediction accuracy than individual data sources

...read moreread less

Journal Article•DOI•

[...]

FatimaEzzahra Laghrissi¹, Samira Douzi, Khadija Douzi¹, Badr Hssina¹•Institutions (1)

University of Hassan II Casablanca¹

Designing a Permissioned Blockchain Network for the Halal Industry using Hyperledger Fabric with multiple channels and the raft consensus mechanism

TL;DR: In this article, the authors implemented deep learning solutions for detecting attacks based on Long Short-Term Memory (LSTM) and used PCA (principal component analysis) and Mutual Information (MI) techniques for dimensionality reduction and feature selection techniques.

...read moreread less

Abstract: An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. It scans a network or a system for a harmful activity or security breaching. IDS protects networks (Network-based intrusion detection system NIDS) or hosts (Host-based intrusion detection system HIDS), and work by either looking for signatures of known attacks or deviations from normal activity. Deep learning algorithms proved their effectiveness in intrusion detection compared to other machine learning methods. In this paper, we implemented deep learning solutions for detecting attacks based on Long Short-Term Memory (LSTM). PCA (principal component analysis) and Mutual information (MI) are used as dimensionality reduction and feature selection techniques. Our approach was tested on a benchmark data set, KDD99, and the experimental outcomes show that models based on PCA achieve the best accuracy for training and testing, in both binary and multiclass classification.

...read moreread less

Journal Article•DOI•

[...]

Isti Surjandari¹, Harman Yusuf¹, Enrico Laoh¹, Rayi Maulida¹•Institutions (1)

University of Indonesia¹

Unsupervised outlier detection in multidimensional data

TL;DR: In this article, the authors used a blockchain network with three channels and used the raft consensus algorithm in designing web interfaces and testing their capabilities to thwart the formation of a block in case of data input errors from the user The server can also do the process as a provider of information and validator for the web interface.

...read moreread less

Abstract: Halal Supply Chain Management requires an assurance that the entire process of procurement, distribution, handling, and processing materials, spare parts, livestock, work-in-process, or finished inventory to be well documented and performed fit to the Halal and Toyyib Blockchain technology is one alternative solution that can improve Halal Supply Chain as it can integrate technology for information exchange during the tracking and tracing process in operating and monitoring performance This technology could improve trust, transparency, and information disclosure between supply chain participants since it could act as a distributed ledger and entitle all transactions to be completely open, yet confidential, immutable, and secured This study uses a Blockchain Network with three channels and uses raft consensus algorithm in designing web interfaces and testing their capabilities From the web interface, there were no failures in the validity test during the invoke test and the query test In addition, the web interface was also successfully tested to thwart the formation of a block in case of data input errors from the user The server can also do the process as a provider of information and validator for the web interface From the results of simulations conducted on the Blockchain Network that was made, Blockchain’s transaction speed is fast and all the transaction is successfully transferred to other peers Thus, Permissioned Blockchain is useful for Halal Supply Chain not just because it can secure transactions from some of the halal issues, but the transaction speed and rate to transfer data are very effective

...read moreread less

Journal Article•DOI•

[...]

Atiq Ur Rehman¹, Samir Brahim Belhaouari¹•Institutions (1)

Khalifa University¹

25 Feb 2021-Journal of Big Data

TL;DR: The working of classification-based methods mostly relies on a confidence score, which is calculated by the classifier while making a prediction for the test observation, and some clusteringbased methods identify the outliers by not forcing every observation to belong to a label.

...read moreread less

Abstract: Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

...read moreread less

Journal Article•DOI•

Querying Knowledge Graphs in Natural Language

[...]

Shiqi Liang¹, Kurt Stockinger², Tarcisio Mendes de Farias³, Tarcisio Mendes de Farias⁴, Maria Anisimova⁴, Maria Anisimova², Manuel Gil², Manuel Gil⁴ - Show less +4 more•Institutions (4)

ETH Zurich¹, Zürcher Fachhochschule², University of Lausanne³, Swiss Institute of Bioinformatics⁴

06 Jan 2021-Journal of Big Data

TL;DR: In this article, the authors propose a new QA system for translating natural language questions into SPARQL queries, which breaks up the translation process into five smaller, more manageable sub-tasks and uses ensemble machine learning methods as well as Tree-LSTM-based neural network models.

...read moreread less

Abstract: Knowledge graphs are a powerful concept for querying large amounts of data. These knowledge graphs are typically enormous and are often not easily accessible to end-users because they require specialized knowledge in query languages such as SPARQL. Moreover, end-users need a deep understanding of the structure of the underlying data models often based on the Resource Description Framework (RDF). This drawback has led to the development of Question-Answering (QA) systems that enable end-users to express their information needs in natural language. While existing systems simplify user access, there is still room for improvement in the accuracy of these systems. In this paper we propose a new QA system for translating natural language questions into SPARQL queries. The key idea is to break up the translation process into 5 smaller, more manageable sub-tasks and use ensemble machine learning methods as well as Tree-LSTM-based neural network models to automatically learn and translate a natural language question into a SPARQL query. The performance of our proposed QA system is empirically evaluated using the two renowned benchmarks-the 7th Question Answering over Linked Data Challenge (QALD-7) and the Large-Scale Complex Question Answering Dataset (LC-QuAD). Experimental results show that our QA system outperforms the state-of-art systems by 15% on the QALD-7 dataset and by 48% on the LC-QuAD dataset, respectively. In addition, we make our source code available.

...read moreread less

Journal Article•DOI•

An alternative approach to dimension reduction for pareto distributed data: a case study

[...]

Marco Roccetti¹, Giovanni Delnevo¹, Luca Casini¹, Silvia Mirri¹•Institutions (1)

University of Bologna¹

Detecting web attacks using random undersampling and ensemble learners

TL;DR: In this paper, a deep learning model was used to detect defective water meter devices with a prediction accuracy in the range of 87-90% even in the presence of categorical descriptors.

...read moreread less

Abstract: Deep learning models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data descriptors are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on categorical data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a Pareto analysis. In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the Pareto rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87–90%, even in the presence of categorical descriptors.

...read moreread less

Journal Article•DOI•

[...]

Richard Zuech¹, John Hancock¹, Taghi M. Khoshgoftaar¹•Institutions (1)

A hybrid recommender system based-on link prediction for movie baskets analysis

TL;DR: In this article, the authors explored classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset by considering a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99: 1, 95:5, 9:1 and 3:1.

...read moreread less

Abstract: Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.

...read moreread less

Journal Article•DOI•

[...]

Mohammadsadegh Vahidi Farashah¹, Akbar Etebarian¹, Reza Azmi², Reza Ebrahimzadeh Dastjerdi¹•Institutions (2)

Islamic Azad University, Isfahan¹, Alzahra University²

Detecting cybersecurity attacks across different network features and learners

TL;DR: The proposed method in this paper consists of initial clustering of all users and assigning new user to appropriate clusters, assigning appropriate weights to users' characteristics, and identifying new user’s adjacent users using hybrid similarity criteria and adjacency matrix.

...read moreread less

Abstract: Over the past decade, recommendation systems have been one of the most sought after by various researchers. Basket analysis of online systems’ customers and recommending attractive products (movies) to them is very important. Providing an attractive and favorite movie to the customer will increase the sales rate and ultimately improve the system. Various methods have been proposed so far to analyze customer baskets and offer entertaining movies but each of the proposed methods has challenges, such as lack of accuracy and high error of recommendations. In this paper, a link prediction-based method is used to meet the challenges of other methods. The proposed method in this paper consists of four phases: (1) Running the CBRS that in this phase, all users are clustered using Density-based spatial clustering of applications with noise algorithm (DBScan), and classification of new users using Deep Neural Network (DNN) algorithm. (2) Collaborative Recommender System (CRS) Based on Hybrid Similarity Criterion through which similarities are calculated based on a threshold (lambda) between the new user and the users in the selected category. Similarity criteria are determined based on age, gender, and occupation. The collaborative recommender system extracts users who are the most similar to the new user. Then, the higher-rated movie services are suggested to the new user based on the adjacency matrix. (3) Running improved Friendlink algorithm on the dataset to calculate the similarity between users who are connected through the link. (4) This phase is related to the combination of collaborative recommender system’s output and improved Friendlink algorithm. The results show that the Mean Squared Error (MSE) of the proposed model has decreased respectively 8.59%, 8.67%, 8.45% and 8.15% compared to the basic models such as Naive Bayes, multi-attribute decision tree and randomized algorithm. In addition, Mean Absolute Error (MAE) of the proposed method decreased by 4.5% compared to SVD and approximately 4.4% compared to ApproSVD and Root Mean Squared Error (RMSE) of the proposed method decreased by 6.05 % compared to SVD and approximately 6.02 % compared to ApproSVD.

...read moreread less

Journal Article•DOI•

[...]

Joffrey L. Leevy¹, John Hancock¹, Richard Zuech¹, Taghi M. Khoshgoftaar¹•Institutions (1)

Predictive analytics using Big Data for the real estate market during the COVID-19 pandemic

TL;DR: In this article, the authors used the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers, including Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost.

...read moreread less

Abstract: Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publicly available, modern, and covers a wide range of realistic attack types Our contribution is centered around answers to three research questions The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” The third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” These research questions are all answered in the affirmative and provide valuable, practical information for the development of an efficient intrusion detection model To the best of our knowledge, we are the first to use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset

...read moreread less

Journal Article•DOI•

[...]

Andrius Grybauskas¹, Vaida Pilinkienė¹, Alina Stundžienė¹•Institutions (1)

Kaunas University of Technology¹

User profile correlation-based similarity (UPCSim) algorithm in movie recommendation system

TL;DR: In this article, the authors used a web-scraping algorithm and collected a total of 18,992 property listings in the city of Vilnius during the first wave of the COVID-19 pandemic.

...read moreread less

Abstract: As the COVID-19 pandemic came unexpectedly, many real estate experts claimed that the property values would fall like the 2007 crash. However, this study raises the question of what attributes of an apartment are most likely to influence a price revision during the pandemic. The findings in prior studies have lacked consensus, especially regarding the time-on-the-market variable, which exhibits an omnidirectional effect. However, with the rise of Big Data, this study used a web-scraping algorithm and collected a total of 18,992 property listings in the city of Vilnius during the first wave of the COVID-19 pandemic. Afterwards, 15 different machine learning models were applied to forecast apartment revisions, and the SHAP values for interpretability were used. The findings in this study coincide with the previous literature results, affirming that real estate is quite resilient to pandemics, as the price drops were not as dramatic as first believed. Out of the 15 different models tested, extreme gradient boosting was the most accurate, although the difference was negligible. The retrieved SHAP values conclude that the time-on-the-market variable was by far the most dominant and consistent variable for price revision forecasting. Additionally, the time-on-the-market variable exhibited an inverse U-shaped behaviour.

...read moreread less

Journal Article•DOI•

[...]

Triyanna Widiyaningtyas¹, Triyanna Widiyaningtyas², Indriana Hidayah², Teguh Bharata Adji²•Institutions (2)

State University of Malang¹, Gadjah Mada University²

A reconstruction error-based framework for label noise detection

TL;DR: A new similarity algorithm is proposed – so-called User Profile Correlation-based Similarity (UPCSim) – that examines the genre data and the user profile data, namely age, gender, occupation, and location that outperforms the previous algorithm on recommendation accuracy.

...read moreread less

Abstract: Collaborative filtering is one of the most widely used recommendation system approaches. One issue in collaborative filtering is how to use a similarity algorithm to increase the accuracy of the recommendation system. Most recently, a similarity algorithm that combines the user rating value and the user behavior value has been proposed. The user behavior value is obtained from the user score probability in assessing the genre data. The problem with the algorithm is it only considers genre data for capturing user behavior value. Therefore, this study proposes a new similarity algorithm – so-called User Profile Correlation-based Similarity (UPCSim) – that examines the genre data and the user profile data, namely age, gender, occupation, and location. All the user profile data are used to find the weights of the similarities of user rating value and user behavior value. The weights of both similarities are obtained by calculating the correlation coefficients between the user profile data and the user rating or behavior values. An experiment shows that the UPCSim algorithm outperforms the previous algorithm on recommendation accuracy, reducing MAE by 1.64% and RMSE by 1.4%.

...read moreread less

Journal Article•DOI•

[...]

Zahra Salekshahrezaee¹, Joffrey L. Leevy¹, Taghi M. Khoshgoftaar¹•Institutions (1)

Sleep stage classification using extreme learning machine and particle swarm optimization for healthcare big data

01 Apr 2021-Journal of Big Data

TL;DR: Binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise.

...read moreread less

Abstract: Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely $$\textit{principal component analysis} \hbox { (PCA)}$$ , $$\textit{independent component analysis} \hbox { (ICA)}$$ , and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).

...read moreread less

Journal Article•DOI•

[...]

Nico Surantha¹, Tri Fennia Lesmana¹, Sani M. Isa¹•Institutions (1)

Binus University¹

07 Jan 2021-Journal of Big Data

TL;DR: In this paper, an accurate model for classifying sleep stages by features of Heart Rate Variability (HRV) extracted from Electrocardiogram (ECG) was developed to predict the sleep stages proportion.

...read moreread less

Abstract: Recent developments of portable sensor devices, cloud computing, and machine learning algorithms have led to the emergence of big data analytics in healthcare. The condition of the human body, e.g. the ECG signal, can be monitored regularly by means of a portable sensor device. The use of the machine learning algorithm would then provide an overview of a patient’s current health on a regular basis compared to a medical doctor’s diagnosis that can only be made during a hospital visit. This work aimed to develop an accurate model for classifying sleep stages by features of Heart Rate Variability (HRV) extracted from Electrocardiogram (ECG). The sleep stages classification can be utilized to predict the sleep stages proportion. Where sleep stages proportion information can provide an insight of human sleep quality. The integration of Extreme Learning Machine (ELM) and Particle Swarm Optimization (PSO) was utilized for selecting features and determining the number of hidden nodes. The results were compared to Support Vector Machine (SVM) and ELM methods which are lower than the integration of ELM with PSO. The results of accuracy tests for the combined ELM and PSO were 62.66%, 71.52%, 76.77%, and 82.1% respectively for 6, 4, 3, and 2 classes. To sum up, the classification accuracy can be improved by deploying PSO algorithm for feature selection.

...read moreread less

Journal Article•DOI•

The forecast of COVID-19 spread risk at the county level

[...]

Murtadha D. Hssayeni¹, Arjuna Chala², Roger Dev², Lili Xu², Jesse C P B Shaw², Borko Furht¹, Behnaz Ghoraani¹ - Show less +3 more•Institutions (2)

Florida Atlantic University¹, LexisNexis²

An analytics model for TelecoVAS customers’ basket clustering using ensemble learning approach

TL;DR: Murtadha et al. as mentioned in this paper used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID19 outbreaks in the United States.

...read moreread less

Abstract: The early detection of the coronavirus disease 2019 (COVID-19) outbreak is important to save people’s lives and restart the economy quickly and safely. People’s social behavior, reflected in their mobility data, plays a major role in spreading the disease. Therefore, we used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID-19 outbreaks in the United States. The daily data are fed to a deep learning model based on Long Short-Term Memory (LSTM) to predict the accumulated number of COVID-19 cases in the next two weeks. A significant average correlation was achieved (r=0.83 (p = 0.005)) between the model predicted and actual accumulated cases in the interval from August 1, 2020 until January 22, 2021. The model predictions had r > 0.7 for 87% of the counties across the United States. A lower correlation was reported for the counties with total cases of <1000 during the test interval. The average mean absolute error (MAE) was 605.4 and decreased with a decrease in the total number of cases during the testing interval. The model was able to capture the effect of government responses on COVID-19 cases. Also, it was able to capture the effect of age demographics on the COVID-19 spread. It showed that the average daily cases decreased with a decrease in the retiree percentage and increased with an increase in the young percentage. Lessons learned from this study not only can help with managing the COVID-19 pandemic but also can help with early and effective management of possible future pandemics. The code used for this study was made publicly available on https://github.com/Murtadha44/covid-19-spread-risk.

...read moreread less

Journal Article•DOI•

[...]

Mohammadsadegh Vahidi Farashah¹, Akbar Etebarian¹, Reza Azmi², Reza Ebrahimzadeh Dastjerdi¹•Institutions (2)

Islamic Azad University, Isfahan¹, Alzahra University²