scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Big Data in 2020"


Journal ArticleDOI
TL;DR: A fake news data repository FakeNewsNet is presented, which contains two comprehensive data sets with diverse features in news content, social context, and spatiotemporal information, and is discussed for potential applications on fake news study on social media.
Abstract: Social media has become a popular means for people to consume and share the news. At the same time, however, it has also enabled the wide dissemination of fake news, that is, news with intentionally false information, causing significant negative effects on society. To mitigate this problem, the research of fake news detection has recently received a lot of attention. Despite several existing computational solutions on the detection of fake news, the lack of comprehensive and community-driven fake news data sets has become one of major roadblocks. Not only existing data sets are scarce, they do not contain a myriad of features often required in the study such as news content, social context, and spatiotemporal information. Therefore, in this article, to facilitate fake news-related research, we present a fake news data repository FakeNewsNet, which contains two comprehensive data sets with diverse features in news content, social context, and spatiotemporal information. We present a comprehensive description of the FakeNewsNet, demonstrate an exploratory analysis of two data sets from different perspectives, and discuss the benefits of the FakeNewsNet for potential applications on fake news study on social media.

577 citations


Journal ArticleDOI
TL;DR: This paper adopts Random Forest to select the important feature in classification and compares the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination to get the best percentage accuracy and kappa.
Abstract: Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

271 citations


Journal ArticleDOI
TL;DR: This survey takes an interdisciplinary approach to cover studies related to CatBoost in a single work, and provides researchers an in-depth understanding to help clarify proper application of Cat boost in solving problems.
Abstract: Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

247 citations


Journal ArticleDOI
TL;DR: This paper focuses and briefly discusses on cybersecurity data science, where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions.
Abstract: In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science, where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions. Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

240 citations


Journal ArticleDOI
TL;DR: This study provides a starting point for research in determining which techniques for preparing qualitative data for use with neural networks are best, and is the first in-depth look at techniques for working with categorical data in neural networks.
Abstract: This survey investigates current techniques for representing qualitative data for use as input to neural networks. Techniques for using qualitative data in neural networks are well known. However, researchers continue to discover new variations or entirely new methods for working with categorical data in neural networks. Our primary contribution is to cover these representation techniques in a single work. Practitioners working with big data often have a need to encode categorical values in their datasets in order to leverage machine learning algorithms. Moreover, the size of data sets we consider as big data may cause one to reject some encoding techniques as impractical, due to their running time complexity. Neural networks take vectors of real numbers as inputs. One must use a technique to map qualitative values to numerical values before using them as input to a neural network. These techniques are known as embeddings, encodings, representations, or distributed representations. Another contribution this work makes is to provide references for the source code of various techniques, where we are able to verify the authenticity of the source code. We cover recent research in several domains where researchers use categorical data in neural networks. Some of these domains are natural language processing, fraud detection, and clinical document automation. This study provides a starting point for research in determining which techniques for preparing qualitative data for use with neural networks are best. It is our intention that the reader should use these implementations as a starting point to design experiments to evaluate various techniques for working with qualitative data in neural networks. The third contribution we make in this work is a new perspective on techniques for using categorical data in neural networks. We organize techniques for using categorical data in neural networks into three categories. We find three distinct patterns in techniques that identify a technique as determined, algorithmic, or automated. The fourth contribution we make is to identify several opportunities for future research. The form of the data that one uses as an input to a neural network is crucial for using neural networks effectively. This work is a tool for researchers to find the most effective technique for working with categorical data in neural networks, in big data settings. To the best of our knowledge this is the first in-depth look at techniques for working with categorical data in neural networks.

217 citations


Journal ArticleDOI
TL;DR: This survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem, techniques/algorithms, and tools (big data applications/frameworks).
Abstract: Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks) Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection

159 citations


Journal ArticleDOI
TL;DR: An analysis of the UNSW-NB15 intrusion detection dataset is presented and a filter-based feature reduction technique using the XGBoost algorithm is applied that allows for methods such as the DT to increase its test accuracy from 88.13 to 90.85% for the binary classification scheme.
Abstract: Computer networks intrusion detection systems (IDSs) and intrusion prevention systems (IPSs) are critical aspects that contribute to the success of an organization. Over the past years, IDSs and IPSs using different approaches have been developed and implemented to ensure that computer networks within enterprises are secure, reliable and available. In this paper, we focus on IDSs that are built using machine learning (ML) techniques. IDSs based on ML methods are effective and accurate in detecting networks attacks. However, the performance of these systems decreases for high dimensional data spaces. Therefore, it is crucial to implement an appropriate feature extraction method that can prune some of the features that do not possess a great impact in the classification process. Moreover, many of the ML based IDSs suffer from an increase in false positive rate and a low detection accuracy when the models are trained on highly imbalanced datasets. In this paper, we present an analysis the UNSW-NB15 intrusion detection dataset that will be used for training and testing our models. Moreover, we apply a filter-based feature reduction technique using the XGBoost algorithm. We then implement the following ML approaches using the reduced feature space: Support Vector Machine (SVM), k-Nearest-Neighbour (kNN), Logistic Regression (LR), Artificial Neural Network (ANN) and Decision Tree (DT). In our experiments, we considered both the binary and multiclass classification configurations. The results demonstrated that the XGBoost-based feature selection method allows for methods such as the DT to increase its test accuracy from 88.13 to 90.85% for the binary classification scheme.

159 citations


Journal ArticleDOI
TL;DR: This study proposes a transfer learning based approach tackling the aforementioned shortcomings of existing ABSA methods and proposes an advanced sentiment analysis method, namely Aspect Enhanced Sentiment Analysis (AESA) to classify text into sentiment classes with consideration of the entity aspects.
Abstract: Sentiment analysis is recognized as one of the most important sub-areas in Natural Language Processing (NLP) research, where understanding implicit or explicit sentiments expressed in social media contents is valuable to customers, business owners, and other stakeholders. Researchers have recognized that the generic sentiments extracted from the textual contents are inadequate, thus, Aspect Based Sentiment Analysis (ABSA) was coined to capture aspect sentiments expressed toward specific review aspects. Existing ABSA methods not only treat the analytical problem as single-label classification that requires a fairly large amount of labelled data for model training purposes, but also underestimate the entity aspects that are independent of certain sentiments. In this study, we propose a transfer learning based approach tackling the aforementioned shortcomings of existing ABSA methods. Firstly, the proposed approach extends the ABSA methods with multi-label classification capabilities. Secondly, we propose an advanced sentiment analysis method, namely Aspect Enhanced Sentiment Analysis (AESA) to classify text into sentiment classes with consideration of the entity aspects. Thirdly, we extend two state-of-the-art transfer learning models as the analytical vehicles of multi-label ABSA and AESA tasks. We design an experiment that includes data from different domains to extensively evaluate the proposed approach. The empirical results undoubtedly exhibit that the proposed approach outperform all the baseline approaches.

150 citations


Journal ArticleDOI
TL;DR: This work collected 2 years of data from Chinese stock market and proposed a comprehensive customization of feature engineering and deep learning-based model for predicting price trend of stock markets, which achieves overall high accuracy for stock market trend prediction.
Abstract: In the era of big data, deep learning for predicting stock market prices and trends has become even more popular than before. We collected 2 years of data from Chinese stock market and proposed a comprehensive customization of feature engineering and deep learning-based model for predicting price trend of stock markets. The proposed solution is comprehensive as it includes pre-processing of the stock market dataset, utilization of multiple feature engineering techniques, combined with a customized deep learning based system for stock market price trend prediction. We conducted comprehensive evaluations on frequently used machine learning models and conclude that our proposed solution outperforms due to the comprehensive feature engineering that we built. The system achieves overall high accuracy for stock market trend prediction. With the detailed design and evaluation of prediction term lengths, feature engineering, and data pre-processing methods, this work contributes to the stock analysis research community both in the financial and technical domains.

128 citations


Journal ArticleDOI
TL;DR: An extensive comparative analysis of ensemble techniques such as boosting, bagging, blending and super learners (stacking) suggests that an innovative study in the domain of stock market direction prediction ought to include ensemble techniques in their sets of algorithms.
Abstract: Stock-market prediction using machine-learning technique aims at developing effective and efficient models that can provide a better and higher rate of prediction accuracy. Numerous ensemble regressors and classifiers have been applied in stock market predictions, using different combination techniques. However, three precarious issues come in mind when constructing ensemble classifiers and regressors. The first concerns with the choice of base regressor or classifier technique adopted. The second concerns the combination techniques used to assemble multiple regressors or classifiers and the third concerns with the quantum of regressors or classifiers to be ensembled. Subsequently, the number of relevant studies scrutinising these previously mentioned concerns are limited. In this study, we performed an extensive comparative analysis of ensemble techniques such as boosting, bagging, blending and super learners (stacking). Using Decision Trees (DT), Support Vector Machine (SVM) and Neural Network (NN), we constructed twenty-five (25) different ensembled regressors and classifiers. We compared their execution times, accuracy, and error metrics over stock-data from Ghana Stock Exchange (GSE), Johannesburg Stock Exchange (JSE), Bombay Stock Exchange (BSE-SENSEX) and New York Stock Exchange (NYSE), from January 2012 to December 2018. The study outcome shows that stacking and blending ensemble techniques offer higher prediction accuracies (90–100%) and (85.7–100%) respectively, compared with that of bagging (53–97.78%) and boosting (52.7–96.32%). Furthermore, the root means square error (RMSE) recorded by stacking (0.0001–0.001) and blending (0.002–0.01) shows a better fit of ensemble classifiers and regressors based on these two techniques in market analyses compared with bagging (0.01–0.11) and boosting (0.01–0.443). Finally, the results undoubtedly suggest that an innovative study in the domain of stock market direction prediction ought to include ensemble techniques in their sets of algorithms.

120 citations


Journal ArticleDOI
TL;DR: The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi- class imbalanced conventional and big datasets, respectively and the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.
Abstract: Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains.

Journal ArticleDOI
TL;DR: This survey investigates the predictive BDA applications in supply chain demand forecasting to propose a classification of these applications, identify the gaps, and provide insights for future research.
Abstract: Big data analytics (BDA) in supply chain management (SCM) is receiving a growing attention. This is due to the fact that BDA has a wide range of applications in SCM, including customer behavior analysis, trend analysis, and demand prediction. In this survey, we investigate the predictive BDA applications in supply chain demand forecasting to propose a classification of these applications, identify the gaps, and provide insights for future research. We classify these algorithms and their applications in supply chain management into time-series forecasting, clustering, K-nearest-neighbors, neural networks, regression analysis, support vector machines, and support vector regression. This survey also points to the fact that the literature is particularly lacking on the applications of BDA for demand forecasting in the case of closed-loop supply chains (CLSCs) and accordingly highlights avenues for future research.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.
Abstract: In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

Journal ArticleDOI
TL;DR: It is determined that the best performance scores for each study were unexpectedly high overall, which may be due to overfitting, and that information on the data cleaning of CSE-CIC-IDS2018 was inadequate across the board, a finding that may indicate problems with reproducibility of experiments.
Abstract: The exponential growth in computer networks and network applications worldwide has been matched by a surge in cyberattacks. For this reason, datasets such as CSE-CIC-IDS2018 were created to train predictive models on network-based intrusion detection. These datasets are not meant to serve as repositories for signature-based detection systems, but rather to promote research on anomaly-based detection through various machine learning approaches. CSE-CIC-IDS2018 contains about 16,000,000 instances collected over the course of ten days. It is the most recent intrusion detection dataset that is big data, publicly available, and covers a wide range of attack types. This multi-class dataset has a class imbalance, with roughly 17% of the instances comprising attack (anomalous) traffic. Our survey work contributes several key findings. We determined that the best performance scores for each study, where available, were unexpectedly high overall, which may be due to overfitting. We also found that most of the works did not address class imbalance, the effects of which can bias results in a big data study. Lastly, we discovered that information on the data cleaning of CSE-CIC-IDS2018 was inadequate across the board, a finding that may indicate problems with reproducibility of experiments. In our survey, major research gaps have also been identified.

Journal ArticleDOI
TL;DR: Through this systematic review, it is found that the methods proposed to solve physical sensor data errors cannot be directly compared due to the non-uniform evaluation process and the high use of non-publicly available datasets.
Abstract: Sensor data quality plays a vital role in Internet of Things (IoT) applications as they are rendered useless if the data quality is bad. This systematic review aims to provide an introduction and guide for researchers who are interested in quality-related issues of physical sensor data. The process and results of the systematic review are presented which aims to answer the following research questions: what are the different types of physical sensor data errors, how to quantify or detect those errors, how to correct them and what domains are the solutions in. Out of 6970 literatures obtained from three databases (ACM Digital Library, IEEE Xplore and ScienceDirect) using the search string refined via topic modelling, 57 publications were selected and examined. Results show that the different types of sensor data errors addressed by those papers are mostly missing data and faults e.g. outliers, bias and drift. The most common solutions for error detection are based on principal component analysis (PCA) and artificial neural network (ANN) which accounts for about 40% of all error detection papers found in the study. Similarly, for fault correction, PCA and ANN are among the most common, along with Bayesian Networks. Missing values on the other hand, are mostly imputed using Association Rule Mining. Other techniques include hybrid solutions that combine several data science methods to detect and correct the errors. Through this systematic review, it is found that the methods proposed to solve physical sensor data errors cannot be directly compared due to the non-uniform evaluation process and the high use of non-publicly available datasets. Bayesian data analysis done on the 57 selected publications also suggests that publications using publicly available datasets for method evaluation have higher citation rates.

Journal ArticleDOI
TL;DR: The objective of this paper was to show the current landscape of finance dealing with big data, and to show how big data influences different financial sectors, more specifically, its impact on financial markets, financial institutions, and the relationship with internet finance, financial management, internet credit service companies, fraud detection, risk analysis, financial application management, and so on.
Abstract: Big data is one of the most recent business and technical issues in the age of technology. Hundreds of millions of events occur every day. The financial field is deeply involved in the calculation of big data events. As a result, hundreds of millions of financial transactions occur in the financial world each day. Therefore, financial practitioners and analysts consider it an emerging issue of the data management and analytics of different financial products and services. Also, big data has significant impacts on financial products and services. Therefore, identifying the financial issues where big data has a significant influence is also an important issue to explore with the influences. Based on these concepts, the objective of this paper was to show the current landscape of finance dealing with big data, and also to show how big data influences different financial sectors, more specifically, its impact on financial markets, financial institutions, and the relationship with internet finance, financial management, internet credit service companies, fraud detection, risk analysis, financial application management, and so on. The connection between big data and financial-related components will be revealed in an exploratory literature review of secondary data sources. Since big data in the financial field is an extremely new concept, future research directions will be pointed out at the end of this study.

Journal ArticleDOI
TL;DR: The aim of this paper is to provide a comprehensive review of value creation, data value, and Big Data value chains with their different steps, and to construct an end-to-end exhaustive BDVC that regroup most of the addressed phases.
Abstract: Value Chain has been considered as a key model for managing efficiently value creation processes within organizations. However, with the digitization of the end-to-end processes which began to adopt data as a main source of value, traditional value chain models have become outdated. For this, researchers have developed new value chain models, called Data Value Chains, to carry out data driven organizations. Thereafter, new data value chains called Big Data Value chain have emerged with the emergence of Big Data in order to face new data-related challenges such as high volume, velocity, and variety. These Big Data Value Chains describe the data flow within organizations which rely on Big Data to extract valuable insights. It is a set of ordered steps using Big Data Analytics tools and mainly built for going from data generation to knowledge creation. The advances in Big Data and Big Data Value Chain, using clear processes for aggregation and exploitation of data, have given rise to what is called data monetization. Data monetization concept consists of using data from an organization to generate profit. It may be selling the data directly for cash, or relying on that data to create value indirectly. It is important to mention that the concept of monetizing data is not as new as it looks, but with the era of Big Data and Big Data Value Chain it is becoming attractive. The aim of this paper is to provide a comprehensive review of value creation, data value, and Big Data value chains with their different steps. This literature has led us to construct an end-to-end exhaustive BDVC that regroup most of the addressed phases. Furthermore, we present a possible evolution of that generic BDVC to support Big Data Monetization. For this, we discuss different approaches that enable data monetization throughout data value chains. Finally, we highlight the need to adopt specific data monetization models to suit big data specificities.

Journal ArticleDOI
TL;DR: The results prove that different classification models must be used to identify different emotional states.
Abstract: Emotion recognition using brain signals has the potential to change the way we identify and treat some health conditions. Difficulties and limitations may arise in general emotion recognition software due to the restricted number of facial expression triggers, dissembling of emotions, or among people with alexithymia. Such triggers are identified by studying the continuous brainwaves generated by human brain. Electroencephalogram (EEG) signals from the brain give us a more diverse insight on emotional states that one may not be able to express. Brainwave EEG signals can reflect the changes in electrical potential resulting from communications networks between neurons. This research involves analyzing the epoch data from EEG sensor channels and performing comparative analysis of multiple machine learning techniques [namely Support Vector Machine (SVM), K-nearest neighbor, Linear Discriminant Analysis, Logistic Regression and Decision Trees each of these models] were tested with and without principal component analysis (PCA) for dimensionality reduction. Grid search was also utilized for hyper-parameter tuning for each of the tested machine learning models over Spark cluster for lowered execution time. The DEAP Dataset was used in this study, which is a multimodal dataset for the analysis of human affective states. The predictions were based on the labels given by the participants for each of the 40 1-min long excerpts of music. music. Participants rated each video in terms of the level of arousal, valence, like/dislike, dominance and familiarity. The binary class classifiers were trained on the time segmented, 15 s intervals of epoch data, individually for each of the 4 classes. PCA with SVM performed the best and produced an F1-score of 84.73% with 98.01% recall in the 30th to 45th interval of segmentation. For each of the time segments and “a binary training class” a different classification model converges to a better accuracy and recall than others. The results prove that different classification models must be used to identify different emotional states.

Journal ArticleDOI
TL;DR: This paper aims to explore dimensionality reduction on a real telecom dataset and evaluate customers’ clustering in reduced and latent space, compared to original space in order to achieve better quality clustering results.
Abstract: Telecom Companies logs customer’s actions which generate a huge amount of data that can bring important findings related to customer’s behavior and needs. The main characteristics of such data are the large number of features and the high sparsity that impose challenges to the analytics steps. This paper aims to explore dimensionality reduction on a real telecom dataset and evaluate customers’ clustering in reduced and latent space, compared to original space in order to achieve better quality clustering results. The original dataset contains 220 features that belonging to 100,000 customers. However, dimensionality reduction is an important data preprocessing step in the data mining process specially with the presence of curse of dimensionality. In particular, the aim of data reduction techniques is to filter out irrelevant features and noisy data samples. To reduce the high dimensional data, we projected it down to a subspace using well known Principal Component Analysis (PCA) decomposition and a novel approach based on Autoencoder Neural Network, performing in this way dimensionality reduction of original data. Then K-Means Clustering is applied on both-original and reduced data set. Different internal measures were performed to evaluate clustering for different numbers of dimensions and then we evaluated how the reduction method impacts the clustering task.

Journal ArticleDOI
TL;DR: The performances of the machine learning models have been improved by 20% using this ranged approach when the dataset is highly biased with random error, and it is shown how this model can be used to predict the probable backorder products before actual sales take place.
Abstract: Prediction using machine learning algorithms is not well adapted in many parts of the business decision processes due to the lack of clarity and flexibility. The erroneous data as inputs in the prediction process may produce inaccurate predictions. We aim to use machine learning models in the area of the business decision process by predicting products’ backorder while providing flexibility to the decision authority, better clarity of the process, and maintaining higher accuracy. A ranged method is used for specifying different levels of predicting features to cope with the diverse characteristics of real-time data which may happen by machine or human errors. The range is tunable that gives flexibility to the decision managers. The tree-based machine learning is chosen for better explainability of the model. The backorders of products are predicted in this study using Distributed Random Forest (DRF) and Gradient Boosting Machine (GBM). We have observed that the performances of the machine learning models have been improved by 20% using this ranged approach when the dataset is highly biased with random error. We have utilized a five-level metric to indicate the inventory level, sales level, forecasted sales level, and a four-level metric for the lead time. A decision tree from one of the constructed models is analyzed to understand the effects of the ranged approach. As a part of this analysis, we list major probable backorder scenarios to facilitate business decisions. We show how this model can be used to predict the probable backorder products before actual sales take place. The mentioned methods in this research can be utilized in other supply chain cases to forecast backorders.

Journal ArticleDOI
TL;DR: The aim of this paper is to determine domain-based social influencers by means of a framework that incorporates semantic analysis and machine learning modules to measure and predict users’ credibility in numerous domains at different time periods.
Abstract: Online social networks have established virtual platforms enabling people to express their opinions, interests and thoughts in a variety of contexts and domains, allowing legitimate users as well as spammers and other untrustworthy users to publish and spread their content. Hence, it is vital to have an accurate understanding of the contextual content of social users, thus establishing grounds for measuring their social influence accordingly. In particular, there is the need for a better understanding of domain-based social trust to improve and expand the analysis process and determining the credibility of Social Big Data. The aim of this paper is to determine domain-based social influencers by means of a framework that incorporates semantic analysis and machine learning modules to measure and predict users’ credibility in numerous domains at different time periods. The evaluation of the experiment conducted herein validates the applicability of semantic analysis and machine learning techniques in detecting highly trustworthy domain-based influencers.

Journal ArticleDOI
TL;DR: The proposed method has reduced the computational complexity of the machine learning algorithm despite increasing the classification accuracy, and improves previous related approaches with respect to the accuracy of the constrained score.
Abstract: In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale datasets. On the other hand, data mining applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. Semi-supervised learning is a class of machine learning in which unlabeled data and labeled data are used simultaneously to improve feature selection. The goal of feature selection over partially labeled data (semi-supervised feature selection) is to choose a subset of available features with the lowest redundancy with each other and the highest relevancy to the target class, which is the same objective as the feature selection over entirely labeled data. This method actually used the classification to reduce ambiguity in the range of values. First, the similarity values of each pair are collected, and then these values are divided into intervals, and the average of each interval is determined. In the next step, for each interval, the number of pairs in this range is counted. Finally, by using the strength and similarity matrices, a new constraint feature selection ranking is proposed. The performance of the presented method was compared to the performance of the state-of-the-art, and well-known semi-supervised feature selection approaches on eight datasets. The results indicate that the proposed approach improves previous related approaches with respect to the accuracy of the constrained score. In particular, the numerical results showed that the presented approach improved the classification accuracy by about 3% and reduced the number of selected features by 1%. Consequently, it can be said that the proposed method has reduced the computational complexity of the machine learning algorithm despite increasing the classification accuracy.

Journal ArticleDOI
TL;DR: A Deep learning modified neural network (DLMNN) technique is proposed aimed at SA of online products review and Improved Adaptive Neuro-Fuzzy Inferences System (IANFIS), a technique aimed at future prediction of online product prediction to trounce the above-stated issues.
Abstract: A major task that the NLP (Natural Language Processing) has to follow is Sentiments analysis (SA) or opinions mining (OM). For finding whether the user’s attitude is positive, neutral or negative, it captures each user’s opinion, belief, and feelings about the corresponding product. Through this, needed changes can well be done on the product for better customer contentment by the companies. Most of the existent techniques on SA aimed at these online products have extremely low accuracy and also encompassed more time amid training. By employing a Deep learning modified neural network (DLMNN), a technique is proposed aimed at SA of online products review; in addition, via Improved Adaptive Neuro-Fuzzy Inferences System (IANFIS), a technique is proposed aimed at future prediction of online products to trounce the above-stated issues. Firstly, the data values are separated into Contents-based (CB), Grades-based (GB), along with Collaborations based (CLB) setting as of the dataset. Then, each setting goes via review analysis (RA) by employing DLMNN, which renders the results as negative, positive, in addition to neutral reviews. IANFIS carry out a weighting factor and classification on the product for upcoming prediction. In the experimental assessment, the proposed work gave an enhanced performance compared to the existing methods.

Journal ArticleDOI
TL;DR: The integration of process mining, fuzzy multi-attribute decision making and fuzzy association rule learning to detect anomalies and the results showed that the fuzzy associationRule learning method can detect fraud at low confidence levels.
Abstract: Much corporate organization nowadays implement enterprise resource planning (ERP) to manage their business processes. Because the processes run continuously, ERP produces a massive log of processes. Manual observation will have difficulty monitoring the enormous log, especially detecting anomalies. It needs the method that can detect anomalies in the large log. This paper proposes the integration of process mining, fuzzy multi-attribute decision making and fuzzy association rule learning to detect anomalies. Process mining analyses the conformance between recorded event logs and standard operating procedures. The fuzzy multi-attribute decision making is applied to determine the anomaly rates. Finally, the fuzzy association rule learning develops association rules that will be employed to detect anomalies. The results of our experiment showed that the accuracy of the association rule learning method was 0.975 with a minimum confidence level of 0.9 and that the accuracy of the fuzzy association rule learning method was 0.925 with a minimum confidence level of 0.3. Therefore, the fuzzy association rule learning method can detect fraud at low confidence levels.

Journal ArticleDOI
TL;DR: A model that examines the relationship between the application of big data analytics and organizational performance in small and medium enterprises (SMEs) indicated that the ABDA had a positive and significant impact on OP.
Abstract: Drawing from tenets of the resource-based theory, we propose and test a model that examines the relationship between the application of big data analytics (ABDA) and organizational performance (OP) in small and medium enterprises (SMEs). Further, this study examines the mediating role of knowledge management practices (KMP) in relation to the ABDA and OP. Data were collected from respondents working in SMEs through an adapted instrument. This research study adopts the Baron–Kenny approach to test the mediation. The results indicated that the ABDA had a positive and significant impact on OP. Also, KMP had partially mediated the relationship between ABDA and OP in SMEs. The dataset was solely comprised of SMEs from Pakistan administered Kashmir and may not reflect the insights from other regions. Hence limits the generalizability of the results. Findings highlight both strategic and practical implications related to decision making in organizations for top management, particularly in developing countries. This study attempts to contribute to the literature through novel findings and recommendations. These fallouts will help the top management during the key decision-making process and encourage practitioners who seek competitive advantage through enhanced organizational performance in SMEs.

Journal ArticleDOI
TL;DR: This research is re-implementing the basic summarization model that applies the sequence-to-sequence framework on the Arabic language, which has not witnessed the employment of this model in the text summarization before.
Abstract: Natural language processing has witnessed remarkable progress with the advent of deep learning techniques. Text summarization, along other tasks like text translation and sentiment analysis, used deep neural network models to enhance results. The new methods of text summarization are subject to a sequence-to-sequence framework of encoder–decoder model, which is composed of neural networks trained jointly on both input and output. Deep neural networks take advantage of big datasets to improve their results. These networks are supported by the attention mechanism, which can deal with long texts more efficiently by identifying focus points in the text. They are also supported by the copy mechanism that allows the model to copy words from the source to the summary directly. In this research, we are re-implementing the basic summarization model that applies the sequence-to-sequence framework on the Arabic language, which has not witnessed the employment of this model in the text summarization before. Initially, we build an Arabic data set of summarized article headlines. This data set consists of approximately 300 thousand entries, each consisting of an article introduction and the headline corresponding to this introduction. We then apply baseline summarization models to the previous data set and compare the results using the ROUGE scale.

Journal ArticleDOI
TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

Journal ArticleDOI
TL;DR: Results of three models on both imbalanced and balanced datasets shows that precision, accuracy, sensitivity, recall and F-measure of SDA-LM model with im balanced and balanced dataset is improvement than SAE-LM and SDA models.
Abstract: Flight delay is inevitable and it plays an important role in both profits and loss of the airlines. An accurate estimation of flight delay is critical for airlines because the results can be applied to increase customer satisfaction and incomes of airline agencies. There have been many researches on modeling and predicting flight delays, where most of them have been trying to predict the delay through extracting important characteristics and most related features. However, most of the proposed methods are not accurate enough because of massive volume data, dependencies and extreme number of parameters. This paper proposes a model for predicting flight delay based on Deep Learning (DL). DL is one of the newest methods employed in solving problems with high level of complexity and massive amount of data. Moreover, DL is capable to automatically extract the important features from data. Furthermore, due to the fact that most of flight delay data are noisy, a technique based on stack denoising autoencoder is designed and added to the proposed model. Also, Levenberg-Marquart algorithm is applied to find weight and bias proper values, and finally the output has been optimized to produce high accurate results. In order to study effect of stack denoising autoencoder and LM algorithm on the model structure, two other structures are also designed. First structure is based on autoencoder and LM algorithm (SAE-LM), and the second structure is based on denoising autoencoder only (SDA). To investigate the three models, we apply the proposed model on U.S flight dataset that it is imbalanced dataset. In order to create balance dataset, undersampling method are used. We measured precision, accuracy, sensitivity, recall and F-measure of the three models on two cases. Accuracy of the proposed prediction model analyzed and compared to previous prediction method. results of three models on both imbalanced and balanced datasets shows that precision, accuracy, sensitivity, recall and F-measure of SDA-LM model with imbalanced and balanced dataset is improvement than SAE-LM and SDA models. The results also show that accuracy of the proposed model in forecasting flight delay on imbalanced and balanced dataset respectively has greater than previous model called RNN.

Journal ArticleDOI
TL;DR: This research paper made a comparison between machine learning and deep learning algorithms in the optimization of anomaly-based IDS by decreasing the false-positive rate and compared the results with one of the best used classifiers in traditional learning in IDS optimization.
Abstract: Anomaly-based Intrusion Detection System (IDS) has been a hot research topic because of its ability to detect new threats rather than only memorized signatures threats of signature-based IDS. Especially after the availability of advanced technologies that increase the number of hacking tools and increase the risk impact of an attack. The problem of any anomaly-based model is its high false-positive rate. The high false-positive rate is the reason why anomaly IDS is not commonly applied in practice. Because anomaly-based models classify an unseen pattern as a threat where it may be normal but not included in the training dataset. This type of problem is called overfitting where the model is not able to generalize. Optimizing Anomaly-based models by having a big training dataset that includes all possible normal cases may be an optimal solution but could not be applied in practice. Although we can increase the number of training samples to include much more normal cases, still we need a model that has more ability to generalize. In this research paper, we propose applying deep model instead of traditional models because it has more ability to generalize. Thus, we will obtain less false-positive by using big data and deep model. We made a comparison between machine learning and deep learning algorithms in the optimization of anomaly-based IDS by decreasing the false-positive rate. We did an experiment on the NSL-KDD benchmark and compared our results with one of the best used classifiers in traditional learning in IDS optimization. The experiment shows 10% lower false-positive by using deep learning instead of traditional learning.

Journal ArticleDOI
TL;DR: This paper describes a method for learning anomaly behavior in the video by finding an attention region from spatiotemporal information, in contrast to the full-frame learning, and proposes a deep convolution network to distinguish normal and anomalous events.
Abstract: This paper describes a method for learning anomaly behavior in the video by finding an attention region from spatiotemporal information, in contrast to the full-frame learning. In our proposed method, a robust background subtraction (BG) for extracting motion, indicating the location of attention regions is employed. The resulting regions are finally fed into a three-dimensional Convolutional Neural Network (3D CNN). Specifically, by taking advantage of C3D (Convolution 3-dimensional), to completely exploit spatiotemporal relation, a deep convolution network is developed to distinguish normal and anomalous events. Our system is trained and tested against a large-scale UCF-Crime anomaly dataset for validating its effectiveness. This dataset contains 1900 long and untrimmed real-world surveillance videos and splits into 950 anomaly events and 950 normal events, respectively. In total, there are approximately ~ 13 million frames are learned during the training and testing phase. As shown in the experiments section, in terms of accuracy, the proposed visual attention model can obtain 99.25 accuracies. From the industrial application point of view, the extraction of this attention region can assist the security officer on focusing on the corresponding anomaly region, instead of a wider, full-framed inspection.