scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Big Data in 2021"


Journal ArticleDOI
TL;DR: In this paper, a comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field is provided, and the challenges and suggested solutions to help researchers understand the existing research gaps.
Abstract: In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

1,084 citations


Journal ArticleDOI
TL;DR: A survey of data augmentation for text data can be found in this article, where the major motifs of Data Augmentation are summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form.
Abstract: Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

487 citations


Journal ArticleDOI
TL;DR: In this paper, a survey explores how deep learning has been used in combating the COVID-19 pandemic and provides directions for future research on the field of deep learning in computer vision, natural language processing, computer vision and epidemiology.
Abstract: This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tasks are constructed. We begin by evaluating the current state of Deep Learning and conclude with key limitations of Deep Learning for COVID-19 applications. These limitations include Interpretability, Generalization Metrics, Learning from Limited Labeled Data, and Data Privacy. Natural Language Processing applications include mining COVID-19 research for Information Retrieval and Question Answering, as well as Misinformation Detection, and Public Sentiment Analysis. Computer Vision applications cover Medical Image Analysis, Ambient Intelligence, and Vision-based Robotics. Within Life Sciences, our survey looks at how Deep Learning can be applied to Precision Diagnostics, Protein Structure Prediction, and Drug Repurposing. Deep Learning has additionally been utilized in Spread Forecasting for Epidemiology. Our literature review has found many examples of Deep Learning systems to fight COVID-19. We hope that this survey will help accelerate the use of Deep Learning for COVID-19 research.

139 citations


Posted ContentDOI
TL;DR: This paper aggregates some of the literature on missing data particularly focusing on machine learning techniques, and gives insight on how the machine learning approaches work by highlighting the key features of the proposed techniques, how they perform, their limitations and the kind of data they are most suitable for.
Abstract: Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

138 citations


Journal ArticleDOI
TL;DR: This project presents a comparative analysis of 3 major image processing algorithms: SSD, Faster R-CNN, and YOLO and evaluated the performance and accuracy of the three algorithms and analysed their strengths and weaknesses.
Abstract: A computer views all kinds of visual media as an array of numerical values. As a consequence of this approach, they require image processing algorithms to inspect contents of images. This project compares 3 major image processing algorithms: Single Shot Detection (SSD), Faster Region based Convolutional Neural Networks (Faster R-CNN), and You Only Look Once (YOLO) to find the fastest and most efficient of three. In this comparative analysis, using the Microsoft COCO (Common Object in Context) dataset, the performance of these three algorithms is evaluated and their strengths and limitations are analysed based on parameters such as accuracy, precision and F1 score. From the results of the analysis, it can be concluded that the suitability of any of the algorithms over the other two is dictated to a great extent by the use cases they are applied in. In an identical testing environment, YOLO-v3 outperforms SSD and Faster R-CNN, making it the best of the three algorithms.

99 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a comprehensive review of existing data-efficient methods and systematizes them into four categories: creating more data, transferring knowledge from rich data domains into poor data domains, altering data-hungry algorithms to reduce their dependency upon the amount of samples, or transferring knowledge between rich and poor domains.
Abstract: The leading approaches in Machine Learning are notoriously data-hungry Unfortunately, many application domains do not have access to big data because acquiring data involves a process that is expensive or time-consuming This has triggered a serious debate in both the industrial and academic communities calling for more data-efficient models that harness the power of artificial learners while achieving good results with less training data and in particular less human supervision In light of this debate, this work investigates the issue of algorithms’ data hungriness First, it surveys the issue from different perspectives Then, it presents a comprehensive review of existing data-efficient methods and systematizes them into four categories Specifically, the survey covers solution strategies that handle data-efficiency by (i) using non-supervised algorithms that are, by nature, more data-efficient, by (ii) creating artificially more data, by (iii) transferring knowledge from rich-data domains into poor-data domains, or by (iv) altering data-hungry algorithms to reduce their dependency upon the amount of samples, in a way they can perform well in small samples regime Each strategy is extensively reviewed and discussed In addition, the emphasis is put on how the four strategies interplay with each other in order to motivate exploration of more robust and data-efficient algorithms Finally, the survey delineates the limitations, discusses research challenges, and suggests future opportunities to advance the research on data-efficiency in machine learning

65 citations


Journal ArticleDOI
TL;DR: In this article, a deep convolution neural network (CNN) was modified and adapted for person recognition with Image Augmentation (IA) technique depending on gait features, which improved the accuracy of person recognition using gait model comparing to model without adaptation.
Abstract: Person Recognition based on Gait Model (PRGM) and motion features is are indeed a challenging and novel task due to their usages and to the critical issues of human pose variation, human body occlusion, camera view variation, etc. In this project, a deep convolution neural network (CNN) was modified and adapted for person recognition with Image Augmentation (IA) technique depending on gait features. Adaptation aims to get best values for CNN parameters to get best CNN model. In Addition to the CNN parameters Adaptation, the design of CNN model itself was adapted to get best model structure; Adaptation in the design was affected the type, the number of layers in CNN and normalization between them. After choosing best parameters and best design, Image augmentation was used to increase the size of train dataset with many copies of the image to boost the number of different images that will be used to train Deep learning algorithms. The tests were achieved using known dataset (Market dataset). The dataset contains sequential pictures of people in different gait status. The image in CNN model as matrix is extracted to many images or matrices by the convolution, so dataset size may be bigger by hundred times to make the problem a big data issue. In this project, results show that adaptation has improved the accuracy of person recognition using gait model comparing to model without adaptation. In addition, dataset contains images of person carrying things. IA technique improved the model to be robust to some variations such as image dimensions (quality and resolution), rotations and carried things by persons. Results for 200 persons recognition, validation accuracy was about 82% without IA and 96.23 with IA. For 800 persons recognition, validation accuracy was 93.62% without IA.

64 citations


Journal ArticleDOI
TL;DR: In this paper, resampling is used to adjust the ratio between the different classes, making the data more balanced and improving the performance of artificial neural network multi-class classifiers.
Abstract: Machine learning plays an increasingly significant role in the building of Network Intrusion Detection Systems However, machine learning models trained with imbalanced cybersecurity data cannot recognize minority data, hence attacks, effectively One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced This research looks at resampling’s influence on the performance of Artificial Neural Network multi-class classifiers The resampling methods, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with Synthetic Minority Oversampling Technique, and random undersampling with Adaptive Synthetic Sampling Method were used on benchmark Cybersecurity datasets, KDD99, UNSW-NB15, UNSW-NB17 and UNSW-NB18 Macro precision, macro recall, macro F1-score were used to evaluate the results The patterns found were: First, oversampling increases the training time and undersampling decreases the training time; second, if the data is extremely imbalanced, both oversampling and undersampling increase recall significantly; third, if the data is not extremely imbalanced, resampling will not have much of an impact; fourth, with resampling, mostly oversampling, more of the minority data (attacks) were detected

62 citations


Journal ArticleDOI
TL;DR: In this article, the authors examine the most recent developments of GANs based techniques for addressing imbalance problems in image data and propose a taxonomy to summarize GAN-based techniques.
Abstract: Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms.

61 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a genetic algorithm based on community detection, which functions in three steps, where feature similarities are calculated in the first step and features are classified by community detection algorithms into clusters throughout the second step In the third step, features are picked by a GA with a new community-based repair operation.
Abstract: The feature selection is an essential data preprocessing stage in data mining The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps The feature similarities are calculated in the first step The features are classified by community detection algorithms into clusters throughout the second step In the third step, features are picked by a genetic algorithm with a new community-based repair operation Nine benchmark classification problems were analyzed in terms of the performance of the presented approach Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection Comparing the performance of the proposed method with three new feature selection methods based on PSO, ACO, and ABC algorithms on three classifiers showed that the accuracy of the proposed method is on average 052% higher than the PSO, 120% higher than ACO, and 157 higher than the ABC algorithm

58 citations


Journal ArticleDOI
TL;DR: In this paper, a data science model for stock prices forecasting in Indonesian exchange based on the statistical computing based on R language and Long Short-Term Memory (LSTM) was proposed.
Abstract: Stock market process is full of uncertainty; hence stock prices forecasting very important in finance and business. For stockbrokers, understanding trends and supported by prediction software for forecasting is very important for decision making. This paper proposes a data science model for stock prices forecasting in Indonesian exchange based on the statistical computing based on R language and Long Short-Term Memory (LSTM). The first Covid-19 (Coronavirus disease-19) confirmed case in Indonesia is on 2 March 2020. After that, the composite stock price index has plunged 28% since the start of the year and the share prices of cigarette producers and banks in the midst of the corona pandemic reached their lowest value on March 24, 2020. We use the big data from Bank of Central Asia (BCA) and Bank of Mandiri from Indonesia obtained from Yahoo finance. In our experiments, we visualize the data using data science and predict and simulate the important prices called Open, High, Low and Closing (OHLC) with various parameters. Based on the experiment, data science is very useful for visualization data and our proposed method using Long Short-Term Memory (LSTM) can be used as predictor in short term data with accuracy 94.57% comes from the short term (1 year) with high epoch in training phase rather than using 3 years training data.

Journal ArticleDOI
TL;DR: In this article, the authors provide a systematic review of research works that are relevant to AI assurance, between years 1985 and 2021, and aim to provide a structured alternative to the landscape.
Abstract: Artificial Intelligence (AI) algorithms are increasingly providing decision making and operational support across multiple domains. AI includes a wide (and growing) library of algorithms that could be applied for different problems. One important notion for the adoption of AI algorithms into operational decision processes is the concept of assurance. The literature on assurance, unfortunately, conceals its outcomes within a tangled landscape of conflicting approaches, driven by contradicting motivations, assumptions, and intuitions. Accordingly, albeit a rising and novel area, this manuscript provides a systematic review of research works that are relevant to AI assurance, between years 1985 and 2021, and aims to provide a structured alternative to the landscape. A new AI assurance definition is adopted and presented, and assurance methods are contrasted and tabulated. Additionally, a ten-metric scoring system is developed and introduced to evaluate and compare existing methods. Lastly, in this manuscript, we provide foundational insights, discussions, future directions, a roadmap, and applicable recommendations for the development and deployment of AI assurance.

Journal ArticleDOI
TL;DR: One-class classification (OCC) as mentioned in this paper is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data.
Abstract: In severely imbalanced datasets, using traditional binary or multi-class classification typically leads to bias towards the class(es) with the much larger number of instances. Under such conditions, modeling and detecting instances of the minority class is very difficult. One-class classification (OCC) is an approach to detect abnormal data points compared to the instances of the known class and can serve to address issues related to severely imbalanced datasets, which are especially very common in big data. We present a detailed survey of OCC-related literature works published over the last decade, approximately. We group the different works into three categories: outlier detection, novelty detection, and deep learning and OCC. We closely examine and evaluate selected works on OCC such that a good cross section of approaches, methods, and application domains is represented in the survey. Commonly used techniques in OCC for outlier detection and for novelty detection, respectively, are discussed. We observed one area that has been largely omitted in OCC-related literature is its application context for big data and its inherently associated problems, such as severe class imbalance, class rarity, noisy data, feature selection, and data reduction. We feel the survey will be appreciated by researchers working in these areas of big data.

Journal ArticleDOI
TL;DR: This paper calls this problem-specific a class, which is conducted in relation to a specific problem, such as collecting human genomes from patients for a particular disease, gathering social media data for gender identification, or crawling websites for offensive materials.
Abstract: In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a multi-source information-fusion stock price prediction framework based on a hybrid deep neural network architecture (Convolution Neural Networks (CNN) and Long Short-Term Memory (LSTM) for market analysis.
Abstract: The stock market is very unstable and volatile due to several factors such as public sentiments, economic factors and more Several Petabytes volumes of data are generated every second from different sources, which affect the stock market A fair and efficient fusion of these data sources (factors) into intelligence is expected to offer better prediction accuracy on the stock market However, integrating these factors from different data sources as one dataset for market analysis is seen as challenging because they come in a different format (numerical or text) In this study, we propose a novel multi-source information-fusion stock price prediction framework based on a hybrid deep neural network architecture (Convolution Neural Networks (CNN) and Long Short-Term Memory (LSTM)) named IKN-ConvLSTM Precisely, we design a predictive framework to integrate stock-related information from six (6) heterogeneous sources Secondly, we construct a base model using CNN, and random search algorithm as a feature selector to optimise our initial training parameters Finally, a stacked LSTM network is fine-tuned by using the tuned parameter (features) from the base-model to enhance prediction accuracy Our approach's emperical evaluation was carried out with stock data (January 3, 2017, to January 31, 2020) from the Ghana Stock Exchange (GSE) The results show a good prediction accuracy of 9831%, specificity (09975), sensitivity (08939%) and F-score (09672) of the amalgamated dataset compared with the distinct dataset Based on the study outcome, it can be concluded that efficient information fusion of different stock price indicators as a single data source for market prediction offer high prediction accuracy than individual data sources

Journal ArticleDOI
TL;DR: In this article, the authors implemented deep learning solutions for detecting attacks based on Long Short-Term Memory (LSTM) and used PCA (principal component analysis) and Mutual Information (MI) techniques for dimensionality reduction and feature selection techniques.
Abstract: An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. It scans a network or a system for a harmful activity or security breaching. IDS protects networks (Network-based intrusion detection system NIDS) or hosts (Host-based intrusion detection system HIDS), and work by either looking for signatures of known attacks or deviations from normal activity. Deep learning algorithms proved their effectiveness in intrusion detection compared to other machine learning methods. In this paper, we implemented deep learning solutions for detecting attacks based on Long Short-Term Memory (LSTM). PCA (principal component analysis) and Mutual information (MI) are used as dimensionality reduction and feature selection techniques. Our approach was tested on a benchmark data set, KDD99, and the experimental outcomes show that models based on PCA achieve the best accuracy for training and testing, in both binary and multiclass classification.

Journal ArticleDOI
TL;DR: In this article, the authors used a blockchain network with three channels and used the raft consensus algorithm in designing web interfaces and testing their capabilities to thwart the formation of a block in case of data input errors from the user The server can also do the process as a provider of information and validator for the web interface.
Abstract: Halal Supply Chain Management requires an assurance that the entire process of procurement, distribution, handling, and processing materials, spare parts, livestock, work-in-process, or finished inventory to be well documented and performed fit to the Halal and Toyyib Blockchain technology is one alternative solution that can improve Halal Supply Chain as it can integrate technology for information exchange during the tracking and tracing process in operating and monitoring performance This technology could improve trust, transparency, and information disclosure between supply chain participants since it could act as a distributed ledger and entitle all transactions to be completely open, yet confidential, immutable, and secured This study uses a Blockchain Network with three channels and uses raft consensus algorithm in designing web interfaces and testing their capabilities From the web interface, there were no failures in the validity test during the invoke test and the query test In addition, the web interface was also successfully tested to thwart the formation of a block in case of data input errors from the user The server can also do the process as a provider of information and validator for the web interface From the results of simulations conducted on the Blockchain Network that was made, Blockchain’s transaction speed is fast and all the transaction is successfully transferred to other peers Thus, Permissioned Blockchain is useful for Halal Supply Chain not just because it can secure transactions from some of the halal issues, but the transaction speed and rate to transfer data are very effective

Journal ArticleDOI
TL;DR: The working of classification-based methods mostly relies on a confidence score, which is calculated by the classifier while making a prediction for the test observation, and some clusteringbased methods identify the outliers by not forcing every observation to belong to a label.
Abstract: Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Journal ArticleDOI
TL;DR: In this article, the authors propose a new QA system for translating natural language questions into SPARQL queries, which breaks up the translation process into five smaller, more manageable sub-tasks and uses ensemble machine learning methods as well as Tree-LSTM-based neural network models.
Abstract: Knowledge graphs are a powerful concept for querying large amounts of data. These knowledge graphs are typically enormous and are often not easily accessible to end-users because they require specialized knowledge in query languages such as SPARQL. Moreover, end-users need a deep understanding of the structure of the underlying data models often based on the Resource Description Framework (RDF). This drawback has led to the development of Question-Answering (QA) systems that enable end-users to express their information needs in natural language. While existing systems simplify user access, there is still room for improvement in the accuracy of these systems. In this paper we propose a new QA system for translating natural language questions into SPARQL queries. The key idea is to break up the translation process into 5 smaller, more manageable sub-tasks and use ensemble machine learning methods as well as Tree-LSTM-based neural network models to automatically learn and translate a natural language question into a SPARQL query. The performance of our proposed QA system is empirically evaluated using the two renowned benchmarks-the 7th Question Answering over Linked Data Challenge (QALD-7) and the Large-Scale Complex Question Answering Dataset (LC-QuAD). Experimental results show that our QA system outperforms the state-of-art systems by 15% on the QALD-7 dataset and by 48% on the LC-QuAD dataset, respectively. In addition, we make our source code available.

Journal ArticleDOI
TL;DR: In this paper, a deep learning model was used to detect defective water meter devices with a prediction accuracy in the range of 87-90% even in the presence of categorical descriptors.
Abstract: Deep learning models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data descriptors are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on categorical data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a Pareto analysis. In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the Pareto rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87–90%, even in the presence of categorical descriptors.

Journal ArticleDOI
TL;DR: In this article, the authors explored classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset by considering a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99: 1, 95:5, 9:1 and 3:1.
Abstract: Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.

Journal ArticleDOI
TL;DR: The proposed method in this paper consists of initial clustering of all users and assigning new user to appropriate clusters, assigning appropriate weights to users' characteristics, and identifying new user’s adjacent users using hybrid similarity criteria and adjacency matrix.
Abstract: Over the past decade, recommendation systems have been one of the most sought after by various researchers. Basket analysis of online systems’ customers and recommending attractive products (movies) to them is very important. Providing an attractive and favorite movie to the customer will increase the sales rate and ultimately improve the system. Various methods have been proposed so far to analyze customer baskets and offer entertaining movies but each of the proposed methods has challenges, such as lack of accuracy and high error of recommendations. In this paper, a link prediction-based method is used to meet the challenges of other methods. The proposed method in this paper consists of four phases: (1) Running the CBRS that in this phase, all users are clustered using Density-based spatial clustering of applications with noise algorithm (DBScan), and classification of new users using Deep Neural Network (DNN) algorithm. (2) Collaborative Recommender System (CRS) Based on Hybrid Similarity Criterion through which similarities are calculated based on a threshold (lambda) between the new user and the users in the selected category. Similarity criteria are determined based on age, gender, and occupation. The collaborative recommender system extracts users who are the most similar to the new user. Then, the higher-rated movie services are suggested to the new user based on the adjacency matrix. (3) Running improved Friendlink algorithm on the dataset to calculate the similarity between users who are connected through the link. (4) This phase is related to the combination of collaborative recommender system’s output and improved Friendlink algorithm. The results show that the Mean Squared Error (MSE) of the proposed model has decreased respectively 8.59%, 8.67%, 8.45% and 8.15% compared to the basic models such as Naive Bayes, multi-attribute decision tree and randomized algorithm. In addition, Mean Absolute Error (MAE) of the proposed method decreased by 4.5% compared to SVD and approximately 4.4% compared to ApproSVD and Root Mean Squared Error (RMSE) of the proposed method decreased by 6.05 % compared to SVD and approximately 6.02 % compared to ApproSVD.

Journal ArticleDOI
TL;DR: In this article, the authors used the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers, including Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost.
Abstract: Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publicly available, modern, and covers a wide range of realistic attack types Our contribution is centered around answers to three research questions The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” The third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” These research questions are all answered in the affirmative and provide valuable, practical information for the development of an efficient intrusion detection model To the best of our knowledge, we are the first to use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset

Journal ArticleDOI
TL;DR: In this article, the authors used a web-scraping algorithm and collected a total of 18,992 property listings in the city of Vilnius during the first wave of the COVID-19 pandemic.
Abstract: As the COVID-19 pandemic came unexpectedly, many real estate experts claimed that the property values would fall like the 2007 crash. However, this study raises the question of what attributes of an apartment are most likely to influence a price revision during the pandemic. The findings in prior studies have lacked consensus, especially regarding the time-on-the-market variable, which exhibits an omnidirectional effect. However, with the rise of Big Data, this study used a web-scraping algorithm and collected a total of 18,992 property listings in the city of Vilnius during the first wave of the COVID-19 pandemic. Afterwards, 15 different machine learning models were applied to forecast apartment revisions, and the SHAP values for interpretability were used. The findings in this study coincide with the previous literature results, affirming that real estate is quite resilient to pandemics, as the price drops were not as dramatic as first believed. Out of the 15 different models tested, extreme gradient boosting was the most accurate, although the difference was negligible. The retrieved SHAP values conclude that the time-on-the-market variable was by far the most dominant and consistent variable for price revision forecasting. Additionally, the time-on-the-market variable exhibited an inverse U-shaped behaviour.

Journal ArticleDOI
TL;DR: A new similarity algorithm is proposed – so-called User Profile Correlation-based Similarity (UPCSim) – that examines the genre data and the user profile data, namely age, gender, occupation, and location that outperforms the previous algorithm on recommendation accuracy.
Abstract: Collaborative filtering is one of the most widely used recommendation system approaches. One issue in collaborative filtering is how to use a similarity algorithm to increase the accuracy of the recommendation system. Most recently, a similarity algorithm that combines the user rating value and the user behavior value has been proposed. The user behavior value is obtained from the user score probability in assessing the genre data. The problem with the algorithm is it only considers genre data for capturing user behavior value. Therefore, this study proposes a new similarity algorithm – so-called User Profile Correlation-based Similarity (UPCSim) – that examines the genre data and the user profile data, namely age, gender, occupation, and location. All the user profile data are used to find the weights of the similarities of user rating value and user behavior value. The weights of both similarities are obtained by calculating the correlation coefficients between the user profile data and the user rating or behavior values. An experiment shows that the UPCSim algorithm outperforms the previous algorithm on recommendation accuracy, reducing MAE by 1.64% and RMSE by 1.4%.

Journal ArticleDOI
TL;DR: Binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise.
Abstract: Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely $$\textit{principal component analysis} \hbox { (PCA)}$$ , $$\textit{independent component analysis} \hbox { (ICA)}$$ , and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).

Journal ArticleDOI
TL;DR: In this paper, an accurate model for classifying sleep stages by features of Heart Rate Variability (HRV) extracted from Electrocardiogram (ECG) was developed to predict the sleep stages proportion.
Abstract: Recent developments of portable sensor devices, cloud computing, and machine learning algorithms have led to the emergence of big data analytics in healthcare. The condition of the human body, e.g. the ECG signal, can be monitored regularly by means of a portable sensor device. The use of the machine learning algorithm would then provide an overview of a patient’s current health on a regular basis compared to a medical doctor’s diagnosis that can only be made during a hospital visit. This work aimed to develop an accurate model for classifying sleep stages by features of Heart Rate Variability (HRV) extracted from Electrocardiogram (ECG). The sleep stages classification can be utilized to predict the sleep stages proportion. Where sleep stages proportion information can provide an insight of human sleep quality. The integration of Extreme Learning Machine (ELM) and Particle Swarm Optimization (PSO) was utilized for selecting features and determining the number of hidden nodes. The results were compared to Support Vector Machine (SVM) and ELM methods which are lower than the integration of ELM with PSO. The results of accuracy tests for the combined ELM and PSO were 62.66%, 71.52%, 76.77%, and 82.1% respectively for 6, 4, 3, and 2 classes. To sum up, the classification accuracy can be improved by deploying PSO algorithm for feature selection.

Journal ArticleDOI
TL;DR: Murtadha et al. as mentioned in this paper used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID19 outbreaks in the United States.
Abstract: The early detection of the coronavirus disease 2019 (COVID-19) outbreak is important to save people’s lives and restart the economy quickly and safely. People’s social behavior, reflected in their mobility data, plays a major role in spreading the disease. Therefore, we used the daily mobility data aggregated at the county level beside COVID-19 statistics and demographic information for short-term forecasting of COVID-19 outbreaks in the United States. The daily data are fed to a deep learning model based on Long Short-Term Memory (LSTM) to predict the accumulated number of COVID-19 cases in the next two weeks. A significant average correlation was achieved (r=0.83 (p = 0.005)) between the model predicted and actual accumulated cases in the interval from August 1, 2020 until January 22, 2021. The model predictions had r > 0.7 for 87% of the counties across the United States. A lower correlation was reported for the counties with total cases of <1000 during the test interval. The average mean absolute error (MAE) was 605.4 and decreased with a decrease in the total number of cases during the testing interval. The model was able to capture the effect of government responses on COVID-19 cases. Also, it was able to capture the effect of age demographics on the COVID-19 spread. It showed that the average daily cases decreased with a decrease in the retiree percentage and increased with an increase in the young percentage. Lessons learned from this study not only can help with managing the COVID-19 pandemic but also can help with early and effective management of possible future pandemics. The code used for this study was made publicly available on https://github.com/Murtadha44/covid-19-spread-risk.

Journal ArticleDOI
TL;DR: This paper combines the X-Means algorithm, the ensemble learning system, and the N-List structure to analyze the customer portfolio of a mobile telecommunication company and provide value-added services.
Abstract: Value-Added Services at a Mobile Telecommunication company provide customers with a variety of services Value-added services generate significant revenue annually for telecommunication companies Providing solutions that can provide customers of a telecommunication company with relevant and engaging services has become a major challenge in this field Numerous methods have been proposed so far to analyze customer basket and provide related services Although these methods have many applications, they still face difficulties in improving the accuracy of bids This paper combines the X-Means algorithm, the ensemble learning system, and the N-List structure to analyze the customer portfolio of a mobile telecommunication company and provide value-added services The X-Means algorithm is used to determine the optimal number of clusters and clustering of customers in a mobile telecommunication company The ensemble learning algorithm is also used to assign categories to new Elder customers, and finally to the N-List structure for customer basket analysis By simulating the proposed method and comparing it with other methods including KNN, SVM, and deep neural networks, the accuracy improved to about 7%

Journal ArticleDOI
TL;DR: In this article, the authors use a Creative Commons Attribution 4.0 International License to provide a link to the Creative Commons license and provide a credit line to the mate rial.
Abstract: © The Author(s), 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/