scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Big Data in 2023"


Journal ArticleDOI
TL;DR: In this paper , a hybrid approach for analyzing sentiments is presented, which consists of pre-processing, feature extraction, and sentiment classification, and the model achieves the average precision, average recall, and average F1-score of 94.46, 91.63, and 92.81%, respectively.
Abstract: There is an exponential growth in textual content generation every day in today's world. In-app messaging such as Telegram and WhatsApp, social media websites such as Instagram and Facebook, e-commerce websites like Amazon, Google searches, news publishing websites, and a variety of additional sources are the possible suppliers. Every instant, all these sources produce massive amounts of text data. The interpretation of such data can help business owners analyze the social outlook of their product, brand, or service and take necessary steps. The development of a consumer review summarization model using Natural Language Processing (NLP) techniques and Long short-term memory (LSTM) to present summarized data and help businesses obtain substantial insights into their consumers' behavior and choices is the topic of this research. A hybrid approach for analyzing sentiments is presented in this paper. The process comprises pre-processing, feature extraction, and sentiment classification. Using NLP techniques, the pre-processing stage eliminates the undesirable data from input text reviews. For extracting the features effectively, a hybrid method comprising review-related features and aspect-related features has been introduced for constructing the distinctive hybrid feature vector corresponding to each review. The sentiment classification is performed using the deep learning classifier LSTM. We experimentally evaluated the proposed model using three different research datasets. The model achieves the average precision, average recall, and average F1-score of 94.46%, 91.63%, and 92.81%, respectively.

6 citations


Journal ArticleDOI
TL;DR: In this article , the authors present a survey on state-of-the-art techniques to deal with training DL models to overcome three challenges including small, imbalanced datasets, and lack of generalization.
Abstract: Abstract Data scarcity is a major challenge when training deep learning (DL) models. DL demands a large amount of data to achieve exceptional performance. Unfortunately, many applications have small or inadequate data to train DL frameworks. Usually, manual labeling is needed to provide labeled data, which typically involves human annotators with a vast background of knowledge. This annotation process is costly, time-consuming, and error-prone. Usually, every DL framework is fed by a significant amount of labeled data to automatically learn representations. Ultimately, a larger amount of data would generate a better DL model and its performance is also application dependent. This issue is the main barrier for many applications dismissing the use of DL. Having sufficient data is the first step toward any successful and trustworthy DL application. This paper presents a holistic survey on state-of-the-art techniques to deal with training DL models to overcome three challenges including small, imbalanced datasets, and lack of generalization. This survey starts by listing the learning techniques. Next, the types of DL architectures are introduced. After that, state-of-the-art solutions to address the issue of lack of training data are listed, such as Transfer Learning (TL), Self-Supervised Learning (SSL), Generative Adversarial Networks (GANs), Model Architecture (MA), Physics-Informed Neural Network (PINN), and Deep Synthetic Minority Oversampling Technique (DeepSMOTE). Then, these solutions were followed by some related tips about data acquisition needed prior to training purposes, as well as recommendations for ensuring the trustworthiness of the training dataset. The survey ends with a list of applications that suffer from data scarcity, several alternatives are proposed in order to generate more data in each application including Electromagnetic Imaging (EMI), Civil Structural Health Monitoring, Medical imaging, Meteorology, Wireless Communications, Fluid Mechanics, Microelectromechanical system, and Cybersecurity. To the best of the authors’ knowledge, this is the first review that offers a comprehensive overview on strategies to tackle data scarcity in DL.

6 citations


Journal ArticleDOI
TL;DR: In this paper , Principal Component Analysis (PCA) and Convolutional Autoencoder (CAE) methods are evaluated for credit card fraud detection, and the results show that the implementation of the RUS method followed by the CAE method leads to the best performance for credit-card fraud detection.
Abstract: Abstract Training a machine learning algorithm on a class-imbalanced dataset can be a difficult task, a process that could prove even more challenging under conditions of high dimensionality. Feature extraction and data sampling are among the most popular preprocessing techniques. Feature extraction is used to derive a richer set of reduced dataset features, while data sampling is used to mitigate class imbalance. In this paper, we investigate these two preprocessing techniques, using a credit card fraud dataset and four ensemble classifiers (Random Forest, CatBoost, LightGBM, and XGBoost). Within the context of feature extraction, the Principal Component Analysis (PCA) and Convolutional Autoencoder (CAE) methods are evaluated. With regard to data sampling, the Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and SMOTE Tomek methods are evaluated. The F1 score and Area Under the Receiver Operating Characteristic Curve (AUC) metrics serve as measures of classification performance. Our results show that the implementation of the RUS method followed by the CAE method leads to the best performance for credit card fraud detection.

4 citations


Journal ArticleDOI
TL;DR: In this paper , the authors evaluate the performance of five ensemble learners in the Machine Learning task of Medicare fraud detection and show that AUPRC provides a better insight into classification performance.
Abstract: Abstract Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis of metrics for performance evaluation and what they can hide or reveal is rarely covered in related works. Therefore, we address that gap by analyzing multiple popular performance metrics on three Big Data classification tasks. To the best of our knowledge, we are the first to utilize three new Medicare insurance claims datasets which became publicly available in 2021. These datasets are all highly imbalanced. Furthermore, the datasets are comprised of completely different data. We evaluate the performance of five ensemble learners in the Machine Learning task of Medicare fraud detection. Random Undersampling (RUS) is applied to induce five class ratios. The classifiers are evaluated with both the Area Under the Receiver Operating Characteristic Curve (AUC), and Area Under the Precision Recall Curve (AUPRC) metrics. We show that AUPRC provides a better insight into classification performance. Our findings reveal that the AUC metric hides the performance impact of RUS. However, classification results in terms of AUPRC show RUS has a detrimental effect. We show that, for highly imbalanced Big Data, the AUC metric fails to capture information about precision scores and false positive counts that the AUPRC metric reveals. Our contribution is to show AUPRC is a more effective metric for evaluating the performance of classifiers when working with highly imbalanced Big Data.

3 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a multimodal deep convolutional neural network (CRNet) for predicting customer revisits, which achieved state-of-the-art performance.
Abstract: Since mobile food delivery services have become one of the essential issues for the restaurant industry, predicting customer revisits is highlighted as one of the significant academic and research topics. Considering that the use of multimodal datasets has gained notable attention from several scholars to address multiple industrial issues in our society, we introduce CRNet, a multimodal deep convolutional neural network for predicting customer revisits. We evaluated our approach using two datasets [a customer repurchase dataset (CRD) and mobile food delivery revisit dataset (MFDRD)] and two state-of-the-art multimodal deep learning models. The results showed that CRNet obtained accuracies and Fi-Scores of 0.9575 (CRD) and 0.9436 (MFDRD) and 0.9730 (CRD) and 0.9509 (MFDRD), respectively, thus achieving higher performance levels than current state-of-the-art multimodal frameworks (accuracy: 0.7417-0.9012; F1-Score: 0.7461-0.9378). Future research should aim to address other resources that can enhance the proposed framework (e.g., metadata information).The online version contains supplementary material available at 10.1186/s40537-022-00674-4.

3 citations


Journal ArticleDOI
TL;DR: In this paper , the authors used synthetic minority over-sampling technique (SMOTE) and generative adversarial networks (GAN) to balance class variables to examine the improvement of turnover intention prediction accuracy.
Abstract: Abstract This study aims to improve the accuracy of forecasting the turnover intention of new college graduates by solving the imbalance data problem. For this purpose, data from the Korea Employment Information Service's Job Mobility Survey (Graduates Occupations Mobility Survey: GOMS) for college graduates were used. This data includes various items such as turnover intention, personal characteristics, and job characteristics of new college graduates, and the class ratio of turnover intention is imbalanced. For solving the imbalance data problem, the synthetic minority over-sampling technique (SMOTE) and generative adversarial networks (GAN) were used to balance class variables to examine the improvement of turnover intention prediction accuracy. After deriving the factors affecting the turnover intention by referring to previous studies, a turnover intention prediction model was constructed, and the model's prediction accuracy was analyzed by reflecting each data. As a result of the analysis, the highest predictive accuracy was found in class balanced data through generative adversarial networks rather than class imbalanced original data and class balanced data through SMOTE. The academic implication of this study is that first, the diversity of data sampling methods was presented by expanding and applying GAN, which are widely used in unstructured data sampling fields such as images and images, to structured data in business administration fields such as this study. Second, two refining processes were performed on data generated using generative adversarial networks to suggest a method for refining only data corresponding to a more minority class. The practical implication of this study is that it suggested a plan to predict the turnover intention of new college graduates early through the establishment of a predictive model using public data and machine learning.

3 citations


Journal ArticleDOI
TL;DR: In this paper , a general governance and sustainable architecture for distributed computing continuum systems (DCCS) is proposed, which reflects the human body's self-healing model, and the proposed model has three stages: first, it analyzes system data to acquire knowledge; second it can leverage the knowledge to monitor and predict future conditions; and third it takes further actions to autonomously solve any issue or to alert administrators.
Abstract: Abstract Distributed computing continuum systems (DCCS) make use of a vast number of computing devices to process data generated by edge devices such as the Internet of Things and sensor nodes. Besides performing computations, these devices also produce data including, for example, event logs, configuration files, network management information. When these data are analyzed, we can learn more about the devices, such as their capabilities, processing efficiency, resource usage, and failure prediction. However, these data are available in different forms and have different attributes due to the highly heterogeneous nature of DCCS. The diversity of data poses various challenges which we discuss by relating them to big data, so that we can utilize the advantages of big data analytical tools. We enumerate several existing tools that can perform the monitoring task and also summarize their characteristics. Further, we provide a general governance and sustainable architecture for DCCS, which reflects the human body’s self-healing model. The proposed model has three stages: first, it analyzes system data to acquire knowledge; second, it can leverage the knowledge to monitor and predict future conditions; and third, it takes further actions to autonomously solve any issue or to alert administrators. Thus, the DCCS model is designed to minimize the system’s downtime while optimizing resource usage. A small set of data is used to illustrate the monitoring and prediction of the performance of a system through Bayesian network structure learning. Finally, we discuss the limitations of the governance and sustainability model, and we provide possible solutions to overcome them and make the system more efficient.

3 citations


Journal ArticleDOI
TL;DR: In this article , a study on the positive/negative sentiment and effect of the newly identified, but not well defined, metaverse concept that is already fast evolving the digital landscape is presented.
Abstract: The metaverse has become one of the most popular concepts of recent times. Companies and entrepreneurs are fiercely competing to invest and take part in this virtual world. Millions of people globally are anticipated to spend much of their time in the metaverse, regardless of their age, gender, ethnicity, or culture. There are few comprehensive studies on the positive/negative sentiment and effect of the newly identified, but not well defined, metaverse concept that is already fast evolving the digital landscape. Thereby, this study aimed to better understand the metaverse concept, by, firstly, identifying the positive and negative sentiment characteristics and, secondly, by revealing the associations between the metaverse concept and other related concepts. To do so, this study used Natural Language Processing (NLP) methods, specifically Artificial Intelligence (AI) with computational qualitative analysis. The data comprised metaverse articles from 2021 to 2022 published on The Guardian website, a key global mainstream media outlet. To perform thematic content analysis of the qualitative data, this research used the Leximancer software, and the The Natural Language Toolkit (NLTK) from NLP libraries were used to identify sentiment. Further, an AI-based Monkeylearn API was used to make sectoral classifications of the main topics that emerged in the Leximancer analysis. The key themes which emerged in the Leximancer analysis, included "metaverse", "Facebook", "games" and "platforms". The sentiment analysis revealed that of all articles published in the period of 2021-2022 about the metaverse, 61% (n = 622) were positive, 30% (n = 311) were negative, and 9% (n = 90) were neutral. Positive discourses about the metaverse were found to concern key innovations that the virtual experiences brought to users and companies with the support of the technological infrastructure of blockchain, algorithms, NFTs, led by the gaming world. Negative discourse was found to evidence various problems (misinformation, harmful content, algorithms, data, and equipment) that occur during the use of Facebook and other social media platforms, and that individuals encountered harm in the metaverse or that the metaverse produces new problems. Monkeylearn findings revealed "marketing/advertising/PR" role, "Recreational" business, "Science & Technology" events as the key content topics. This study's contribution is twofold: first, it showcases a novel way to triangulate qualitative data analysis of large unstructured textual data as a method in exploring the metaverse concept; and second, the study reveals the characteristics of the metaverse as a concept, as well as its association with other related concepts. Given that the topic of the metaverse is new, this is the first study, to our knowledge, to do both.

3 citations


Journal ArticleDOI
TL;DR: A systematic literature review of machine learning algorithms for forex market forecasting is presented in this paper , where a total of 60 papers have been published between 2010 and 2021, with the most commonly used machine learning methods being LSTM and artificial neural networks.
Abstract: Abstract Background When you make a forex transaction, you sell one currency and buy another. If the currency you buy increases against the currency you sell, you profit, and you do this through a broker as a retail trader on the internet using a platform known as meta trader. Only 2% of retail traders can successfully predict currency movement in the forex market, making it one of the most challenging tasks. Machine learning and its derivatives or hybrid models are becoming increasingly popular in market forecasting, which is a rapidly developing field. Objective While the research community has looked into the methodologies used by researchers to forecast the forex market, there is still a need to look into how machine learning and artificial intelligence approaches have been used to predict the forex market and whether there are any areas that can be improved to allow for better predictions. Our objective is to give an overview of machine learning models and their application in the FX market. Method This study provides a Systematic Literature Review (SLR) of machine learning algorithms for FX market forecasting. Our research looks at publications that were published between 2010 and 2021. A total of 60 papers are taken into consideration. We looked at them from two angles: I the design of the evaluation techniques, and (ii) a meta-analysis of the performance of machine learning models utilizing evaluation metrics thus far. Results The results of the analysis suggest that the most commonly utilized assessment metrics are MAE, RMSE, MAPE, and MSE, with EURUSD being the most traded pair on the planet. LSTM and Artificial Neural Network are the most commonly used machine learning algorithms for FX market prediction. The findings also point to many unresolved concerns and difficulties that the scientific community should address in the future. Conclusion Based on our findings, we believe that machine learning approaches in the area of currency prediction still have room for development. Researchers interested in creating more advanced strategies might use the open concerns raised in this work as input.

3 citations


Journal ArticleDOI
TL;DR: In this paper , a new method for selecting effective features on network intrusion detection based on the concept of fuzzy numbers and scoring methods based on correlation feature selection for intrusion detection systems is presented.
Abstract: Abstract Due to the increasing growth of the Internet and its widespread application, the number of attacks on the network has also increased. Therefore, maintaining network security and using intrusion detection systems is of critical importance. The connection between devices leads to a large number of data being generated and saved. The era of “big data” emerges over time. This paper presents a new method for selecting effective features on network intrusion detection based on the concept of fuzzy numbers and scoring methods based on correlation feature selection for intrusion detection systems. The goal of this paper is to present a new approach for reducing data size using the concept of fuzzy numbers and scoring methods based on correlation feature selection for intrusion detection systems. In this method, to eliminate inefficient features and reduce data dimensions, number of features are defined as a fuzzy number, and the heuristic function of the correlation-based feature selection algorithm is expressed as a triangular fuzzy number membership function. To evaluate the proposed method, it is then compared to previous intrusion detection methods. The results show that the proposed method selects several features less than the conventional methods with a higher detection rate. The proposed method is compared with the correlation-based feature selection method on two datasets. The proposed method is evaluated and validated on KDD Cup, NSL-KDD and CICIDS datasets. The achieved accuracy is 99.9% which is 96.01% with CFS method.

2 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed a generalized detection approach to identify features for application-layer DoS attacks that is not specific to a single slow DoS attack, and combine four application layer attack datasets: Slow Read, HTTP POST, Slowloris, and Apache Range Header.
Abstract: Abstract With the massive resources and strategies accessible to attackers, countering Denial of Service (DoS) attacks is getting increasingly difficult. One of these techniques is application-layer DoS. Due to these challenges, network security has become increasingly more challenging to ensure. Hypertext Transfer Protocol (HTTP), Domain Name Service (DNS), Simple Mail Transfer Protocol (SMTP), and other application protocols have had increased attacks over the past several years. It is common for application-layer attacks to concentrate on these protocols because attackers can exploit some weaknesses. Flood and “low and slow” attacks are examples of application-layer attacks. They target weaknesses in HTTP, the most extensively used application-layer protocol on the Internet. Our experiment proposes a generalized detection approach to identify features for application-layer DoS attacks that is not specific to a single slow DoS attack. We combine four application-layer DoS attack datasets: Slow Read, HTTP POST, Slowloris, and Apache Range Header. We perform a feature-scaling technique that applies a normalization filter to the combined dataset. We perform a feature extraction technique, Principal Component Analysis (PCA), on the combined dataset to reduce dimensionality. We examine ways to enhance machine learning techniques for detecting slow application-layer DoS attacks that employ these methodologies. The machine learners effectively identify multiple slow DoS attacks, according to our findings. The experiment shows that classifiers are good predictors when combined with our selected Netflow characteristics and feature selection techniques.

Journal ArticleDOI
TL;DR: In this paper , the authors address the detection of crowds in points of interest (POI) by using a territory grid analysis categorizing POIs by the services available in each location and comparing data gathered from a community passive Wi-Fi infrastructure against mobile cellular tower association data from telecom companies.
Abstract: Abstract Sensing passersby and detecting crowded locations is a growing area of research and development in the last decades. The COVID-19 pandemic compelled authorities and public and private institutions to monitor access and occupancy of crowded spaces. This work addresses the detection of crowds in points of interest (POI) by using a territory grid analysis categorizing POIs by the services available in each location and comparing data gathered from a community passive Wi-Fi infrastructure against mobile cellular tower association data from telecom companies. In Madeira islands (Portugal), we used data from the telecom provider NOS for the timespan of 4 months as ground truth and found a strong correlation with sparse passive Wi-Fi. An official regional mobile application shows the occupancy data to end-users based on the territory categorization and the passive Wi-Fi infrastructure in POIs. Occupancy data shows historical hourly trends of each location, and the real-time occupation, helping visitors and locals plan their commutes better to avoid crowded spaces.

Journal ArticleDOI
TL;DR: A taxonomy of open problems in computing for numerically and logically intensive problems in a number of disciplines that have to synergize for the best performance of simulation-based feasibility studies on nature-oriented engineering in general and civil engineering in particular is presented in this article .
Abstract: Abstract This article presents a taxonomy and represents a repository of open problems in computing for numerically and logically intensive problems in a number of disciplines that have to synergize for the best performance of simulation-based feasibility studies on nature-oriented engineering in general and civil engineering in particular. Topics include but are not limited to: Nature-based construction, genomics supporting nature-based construction, earthquake engineering, and other types of geophysical disaster prevention activities, as well as the studies of processes and materials of interest for the above. In all these fields, problems are discussed that generate huge amounts of Big Data and are characterized with mathematically highly complex Iterative Algorithms. In the domain of applications, it has been stressed that problems could be made less computationally demanding if the number of computing iterations is made smaller (with the help of Artificial Intelligence or Conditional Algorithms), or if each computing iteration is made shorter in time (with the help of Data Filtration and Data Quantization). In the domain of computing, it has been stressed that computing could be made more powerful if the implementation technology is changed (Si, GaAs, etc.…), or if the computing paradigm is changed (Control Flow, Data Flow, etc.…).

Journal ArticleDOI
TL;DR: In this article , the quality of structured and unstructured data used for machine learning model construction in head and neck cancer was surveyed. But, the quality assessment was not performed for 14.2% of structured datasets and 11.3% of un-structured datasets before model construction, while outlier detection and lack of representative outcome classes were common limitations for both types of datasets.
Abstract: Abstract Machine learning models have been increasingly considered to model head and neck cancer outcomes for improved screening, diagnosis, treatment, and prognostication of the disease. As the concept of data-centric artificial intelligence is still incipient in healthcare systems, little is known about the data quality of the models proposed for clinical utility. This is important as it supports the generalizability of the models and data standardization. Therefore, this study overviews the quality of structured and unstructured data used for machine learning model construction in head and neck cancer. Relevant studies reporting on the use of machine learning models based on structured and unstructured custom datasets between January 2016 and June 2022 were sourced from PubMed, EMBASE, Scopus, and Web of Science electronic databases. Prediction model Risk of Bias Assessment (PROBAST) tool was used to assess the quality of individual studies before comprehensive data quality parameters were assessed according to the type of dataset used for model construction. A total of 159 studies were included in the review; 106 utilized structured datasets while 53 utilized unstructured datasets. Data quality assessments were deliberately performed for 14.2% of structured datasets and 11.3% of unstructured datasets before model construction. Class imbalance and data fairness were the most common limitations in data quality for both types of datasets while outlier detection and lack of representative outcome classes were common in structured and unstructured datasets respectively. Furthermore, this review found that class imbalance reduced the discriminatory performance for models based on structured datasets while higher image resolution and good class overlap resulted in better model performance using unstructured datasets during internal validation. Overall, data quality was infrequently assessed before the construction of ML models in head and neck cancer irrespective of the use of structured or unstructured datasets. To improve model generalizability, the assessments discussed in this study should be introduced during model construction to achieve data-centric intelligent systems for head and neck cancer management.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end attention and graph-based neural network model (KAGN), which incorporates external knowledge from the knowledge graphs to detect rumor.
Abstract: Abstract Rumor posts have received substantial attention with the rapid development of online and social media platforms. The automatic detection of rumor from posts has emerged as a major concern for the general public, the government, and social media platforms. Most existing methods focus on the linguistic and semantic aspects of posts content, while ignoring knowledge entities and concepts hidden within the article which facilitate rumor detection. To address these limitations, in this paper, we propose a novel end-to-end attention and graph-based neural network model (KAGN), which incorporates external knowledge from the knowledge graphs to detect rumor. Specifically, given the post's sparse and ambiguous semantics, we identify entity mentions in the post’s content and link them to entities and concepts in the knowledge graphs, which serve as complementary semantic information for the post text. To effectively inject external knowledge into textual representations, we develop a knowledge-aware attention mechanism to fuse local knowledge. Additionally, we construct a graph consisting of posts texts, entities, and concepts, which is fed to graph convolutional networks to explore long-range knowledge through graph structure. Our proposed model can therefore detect rumor by combining semantic-level and knowledge-level representations of posts. Extensive experiments on four publicly available real-world datasets show that KAGN outperforms or is comparable to other state-of-the-art methods, and also validate the effectiveness of knowledge.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a novel lightweight wearable aid with higher number of features for visually impaired and blind people, which is named as Blind's Apron based on its appearance and features include obstacle detection, uneven surface detection, slope and downward steps detection, pothole detection and hallow object detection.
Abstract: Abstract The research and innovations in the field of wearable auxiliary devices for visually impaired and blind people are playing a vital role towards improving their quality of life. However, in spite of the promising research outcomes, the existing wearable aids has several weaknesses such as more weight, limitations in the number of features and cost. The main objective of this manuscript is to provide the detailed design of a novel lightweight wearable aid with higher number of features for visually impaired and blind people. The proposed research aims to design a cognitive assistant that will guide the blind people for walking by detecting the environment around them. The framework include a Multi-Sensor Fused Navigation system comprises of a sensor-based, vision-based, and cognitive (intelligent/smart) application. The visual features for the design include obstacle detection, uneven surface detection, slope and downward steps detection, pothole detection and hallow object detection; location tracking, walking guide, image capturing and video recording. This prototype is named as Blind’s Apron based on its appearance. The invention focusses on parameters like reduction on size (quite handy) and light weight (comfortable to wear), higher number of detection features, and minimum user intervention (high end operations like switching on and off). All user interactions are friendly and affordable to everyone. The results obtained in this research lead to a high end technical intervention with ease of use. Finally, the performance of the proposed cognitive assistant is tested with a user study in real-time. The feedback and corresponding results establish the effective outcome of the proposed invention which is a light weight and feature enhanced device with easily understandable instructions.

Journal ArticleDOI
TL;DR: In this article , a multi-criteria recommender system is proposed to analyze the impact of contextual segments on the overall rating based on trip type and hotel classes, where an item-item collaborative filtering approach is introduced to identify the missing value for context in the dataset.
Abstract: Abstract Depending on the RMSE and sites sharing travel details, enormous reviews have been posted day by day. In order to recognize potential target customers in a quick and effective manner, hotels are necessary to establish a customer recommender system. The data adopted in this study was rendered by the Trip Advisor which permits the customers to rate the hotel on the basis of six criteria such as, Service, Sleep Quality, Value, Location, Cleanliness and Room. This study suggest the multi-criteria recommender system to analyse the impact of contextual segments on the overall rating based on trip type and hotel classes. In this research we have introduced item-item collaborative filtering approach. Here, the adjusted cosine similarity measure is applied to identify the missing value for context in the dataset. For the selection of significant contexts the backward elimination with multi regression algorithm is introduced. The multi-collinearity among predictors is examined on the basis of Variance Inflation Factor (V.I.F). In the experimental scenario, the results are rendered based on hotel class and trip type. The performance of the multiregression model is evaluated by the statistical measures such as R-square, MAE, MSE and RMSE. Along with this, the ANOVA study is conducted for different hotel classes and trip types under 2, 3, 4 and 5 star hotel classes.

Journal ArticleDOI
TL;DR: In this paper , a deep weakly supervised reinforcement learning-based approach is proposed to identify anomalies in business processes by leveraging limited labeled anomaly data, which is intended to use a small collection of labeled anomalous data while exploring a huge set of unlabeled data to find new classes of anomalies.
Abstract: Abstract The detection of anomalous behavior in business process data is a crucial task for preventing failures that may jeopardize the performance of any organization. Supervised learning techniques are impracticable because of the difficulties of gathering huge amounts of labeled business process anomaly data. For this reason, unsupervised learning techniques and semi-supervised learning approaches trained on entirely labeled normal data have dominated this domain for a long time. However, these methods do not work well because of the absence of prior knowledge of true anomalies. In this study, we propose a deep weakly supervised reinforcement learning-based approach to identify anomalies in business processes by leveraging limited labeled anomaly data. The proposed approach is intended to use a small collection of labeled anomalous data while exploring a huge set of unlabeled data to find new classes of anomalies that are outside the scope of the labeled anomalous data. We created a unique reward function that combined the supervisory signal supplied by a variational autoencoder trained on unlabeled data with the supervisory signal provided by the environment’s reward. To further reduce data deficiency, we introduced a sampling method to allow the effective exploration of the unlabeled data and to address the imbalanced data problem, which is a common problem in the anomaly detection field. This approach depends on the proximity between the data samples in the latent space of the variational autoencoder. Furthermore, to efficiently model the sequential nature of business process data and to handle the long-term dependences, we used a long short-term memory network combined with a self-attention mechanism to develop the agent of our reinforcement learning model. Multiple scenarios were used to test the proposed approach on real-world and synthetic datasets. The findings revealed that the proposed approach outperformed five competing approaches by efficiently using the few available anomalous examples.

Journal ArticleDOI
TL;DR: In this article , the applicability of Artificial Neural Networks (ANNs) and Adaptive Neuro-Fuzzy Inference System (ANFIS) models in predicting long-term monthly precipitation was investigated using geographical and periodicity component (longitude, latitude, and altitude) data collected from 2011 to 2021.
Abstract: Abstract Global climate change is affecting water resources and other aspects of life in many countries. Rainfall is the most significant climate element affecting the livelihood and well-being of the majority of Ethiopians. Rainfall variability has a great impact on agricultural production, water supply, transportation, the environment, and urban planning. Because all agricultural activities and subsequent national crop production hinge on the amount and distribution of rainfall, accurate monthly and seasonal predictions of this rainfall are vital for agricultural planning. Rainfall prediction is also useful for governmental, non-governmental, and private agencies in making long-term decisions and planning in numerous areas such as farming, early warning of potential hazards, drought mitigation, disaster prevention, and insurance policy. Artificial Intelligence (AI) has been widely used in almost every area, and rainfall prediction is one of them. In this study, we attempt to investigate the use of AI-based models to predict monthly rainfall at 92 Ethiopian meteorological stations. The applicability of Artificial Neural Networks (ANNs) and Adaptive Neuro-Fuzzy Inference System (ANFIS) models in predicting long-term monthly precipitation was investigated using geographical and periodicity component (longitude, latitude, and altitude) data collected from 2011 to 2021. The experimental results reveal that the ANFIS model outperforms the ANN model in all assessment criteria across all testing stations. The Nash–Sutcliffe efficiency coefficients were 0.995 for ANFIS and 0.935 for ANN over testing stations.

Journal ArticleDOI
TL;DR: In this article , the authors propose an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGAs accelerator is fully utilised.
Abstract: Abstract Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.

Journal ArticleDOI
TL;DR: In this paper , a simple regularization term is introduced to manage the number of over-predicted/underpredicted instances in a regression model to minimize the distance between the actual and predicted value.
Abstract: In Machine Learning, prediction quality is usually measured using different techniques and evaluation methods. In the regression models, the goal is to minimize the distance between the actual and predicted value. This error evaluation technique lacks a detailed evaluation of the type of errors that occur on specific data. This paper will introduce a simple regularization term to manage the number of over-predicted/under-predicted instances in a regression model.

Journal ArticleDOI
TL;DR: In this article , the authors investigated the feasibility of liquid biopsy-based gene signatures in predicting the prognosis of lower grade glioma (LGG) patients, as well as the benefits of immunotherapy.
Abstract: Abstract Background Recent studies have shown that immunotherapies, including peptide vaccines, remain promising strategies for patients with lower grade glioma (LGG); however new biomarkers need to be developed to identify patients who may benefit from therapy. We aimed to investigate the feasibility of liquid biopsy-based gene signatures in predicting the prognosis of LGG patients, as well as the benefits of immunotherapy. Methods We evaluated the association between circulating immune cells and treatment response by analyzing peripheral blood mononuclear cell (PBMC) samples from LGG patients receiving peptide vaccine immunotherapy, identified response-related genes (RRGs), and constructed RRG-related Response Score. In addition, RRG-related RiskScore was constructed in LGG tumor samples based on RRGs; association analysis for RiskScore and characteristics of TME as well as patient prognosis were performed in two LGG tumor datasets. The predictive power of RiskScore for immunotherapy benefits was analyzed in an anti-PD-1 treatment cohort. Results This study demonstrated the importance of circulating immune cells, including monocytes, in the immunotherapeutic response and prognosis of patients with LGG. Overall, 43 significant RRGs were identified, and three clusters with different characteristics were identified in PBMC samples based on RRGs. The constructed RRG-related Response Score could identify patients who produced a complete response to peptide vaccine immunotherapy and could predict prognosis. Additionally, three subtypes were identified in LGG tumors based on RRGs, with subtype 2 being an immune “hot” phenotype suitable for immune checkpoint therapy. The constructed RRG-related RiskScore was significantly positively correlated with the level of tumor immune cell infiltration. Patients with high RiskScore had a worse prognosis and were more likely to respond to immune checkpoint therapy. The therapeutic advantage and clinical benefits of patients with a high RiskScore were confirmed in an anti-PD-1 treatment cohort. Conclusion This study confirmed the potential of liquid biopsy for individualized treatment selection in LGG patients and determined the feasibility of circulating immune cells as biomarkers for LGG. Scoring systems based on RRGs can predict the benefits of immunotherapy and prognosis in patients with LGG. This work would help to increase our understanding of the clinical significance of liquid biopsy and more effectively guide individualized immunotherapy strategies.

Journal ArticleDOI
TL;DR: In this paper , an adaptive approach model outperforming other numerical methods in the classification problem was developed using the class center-based Firefly algorithm by incorporating attribute correlations into the imputation process (C3FA).
Abstract: Abstract One of the most common causes of incompleteness is missing data, which occurs when no data value for the variables in observation is stored. An adaptive approach model outperforming other numerical methods in the classification problem was developed using the class center-based Firefly algorithm by incorporating attribute correlations into the imputation process (C3FA). However, this model has not been tested on categorical data, which is essential in the preprocessing stage. Encoding is used to convert text or Boolean values in categorical data into numeric parameters, and the target encoding method is often utilized. This method uses target variable information to encode categorical data and it carries the risk of overfitting and inaccuracy within the infrequent categories. This study aims to use the smoothing target encoding (STE) method to perform the imputation process by combining C3FA and standard deviation (STD) and compare by several imputation methods. The results on the tic tac toe dataset showed that the proposed method (C3FA-STD) produced AUC, CA, F1-Score, precision, and recall values of 0.939, 0.882, 0.881, 0.881, and 0.882, respectively, based on the evaluation using the kNN classifier.

Journal ArticleDOI
TL;DR: In this paper , a variable range-based guard rail modification is proposed that benefits the convergence rate of data elements while simultaneously providing increased confidence in the plausibility of the imputations.
Abstract: Abstract This paper presents a stochastic imputation approach for large datasets using a correlation selection methodology when preferred commercial packages struggle to iterate due to numerical problems. A variable range-based guard rail modification is proposed that benefits the convergence rate of data elements while simultaneously providing increased confidence in the plausibility of the imputations. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The Multicollinearity Applied Stepwise Stochastic imputation methodology (MASS-impute) capitalizes on correlation between variables within the dataset and uses model residuals to estimate unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Tailorable tolerances exploit residual information to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the methodology provides useable and defendable results in imputing missing elements of a country conflict dataset.

Journal ArticleDOI
TL;DR: In this paper , a transfer learning approach to the crop classification problem based on time series of images from the Sentinel-2 dataset labeled for two regions: Brittany (France) and Vojvodina (Serbia).
Abstract: Abstract This paper presents a transfer learning approach to the crop classification problem based on time series of images from the Sentinel-2 dataset labeled for two regions: Brittany (France) and Vojvodina (Serbia). During preprocessing, cloudy images are removed from the input data, the time series are interpolated over the time dimension, and additional remote sensing indices are calculated. We chose TransformerEncoder as the base model for knowledge transfer from source to target domain with French and Serbian data, respectively. Even more, the accuracy of the base model with the preprocessing step is improved by 2% when trained and evaluated on the French dataset. The transfer learning approach with fine-tuning of the pre-trained weights on the French dataset outperformed all other methods in terms of overall accuracy 0.94 and mean class recall 0.907 on the Serbian dataset. Our partially fine-tuned model improved recall of crop types that were poorly classified by the base model. In the case of sugar beet, class recall is improved by 85.71%.

Journal ArticleDOI
TL;DR: In this paper , the authors provide a viewpoint on how to build a system capable of processing big data in real-time, performing analysis, and applying algorithms, and explore the current approaches and how they can be used for the realtime operations and predictions.
Abstract: Big data has a substantial role nowadays, and its importance has significantly increased over the last decade. Big data's biggest advantages are providing knowledge, supporting the decision-making process, and improving the use of resources, services, and infrastructures. The potential of big data increases when we apply it in real-time by providing real-time analysis, predictions, and forecasts, among many other applications. Our goal with this article is to provide a viewpoint on how to build a system capable of processing big data in real-time, performing analysis, and applying algorithms. A system should be designed to handle vast amounts of data and provide valuable knowledge through analysis and algorithms. This article explores the current approaches and how they can be used for the real-time operations and predictions.

Journal ArticleDOI
TL;DR: In this paper , the authors present several deep learning-based IDS for detecting DoS attacks in WSNs, including the blackhole, Grayhole, Flooding, and Scheduling attacks.
Abstract: Abstract Wireless sensor networks (WSNs) are increasingly being used for data monitoring and collection purposes. Typically, they consist of a large number of sensor nodes that are used remotely to collect data about the activities and conditions of a particular area, for example, temperature, pressure, motion. Each sensor node is usually small, inexpensive, and relatively easy to deploy compared to other sensing methods. For this reason, WSNs are used in a wide range of applications and industries. However, WSNs are vulnerable to different kinds of security threats and attacks. This is primarily because they are very limited in resources like power, storage, bandwidth, and processing power that could have been used in developing their defense. To ensure their security, an effective Intrusion detection system (IDS) need to be in place to detect these attacks even under these constraints. Today, traditional IDS are less effective as these malicious attacks are becoming more intelligent, frequent, and complex. Denial of service (DOS) attack is one of the main types of attacks that threaten WSNs. For this reason, we review related works that focus on detecting DoS attacks in WSN. In addition, we developed and implemented several Deep learning (DL) based IDS. These systems were trained on a specialized dataset for WSNs called WSN-DS in detecting four types of DoS attacks that affects WSNs. They include the Blackhole, Grayhole, Flooding, and Scheduling attacks. Finally, we evaluated and compared the results and we discuss possible future works.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a threshold optimization approach that factors in the constraint True Positive Rate (TPR) ≥ True Negative Rate(TNR), which is well suited for addressing class imbalance.
Abstract: Abstract Output thresholding is well-suited for addressing class imbalance, since the technique does not increase dataset size, run the risk of discarding important instances, or modify an existing learner. Through the use of the Credit Card Fraud Detection Dataset, this study proposes a threshold optimization approach that factors in the constraint True Positive Rate (TPR) ≥ True Negative Rate (TNR). Our findings indicate that an increase of the Area Under the Precision–Recall Curve (AUPRC) score is associated with an improvement in threshold-based classification scores, while an increase of positive class prior probability causes optimal thresholds to increase. In addition, we discovered that best overall results for the selection of an optimal threshold are obtained without the use of Random Undersampling (RUS). Furthermore, with the exception of AUPRC, we established that the default threshold yields good performance scores at a balanced class ratio. Our evaluation of four threshold optimization techniques, eight threshold-dependent metrics, and two threshold-agnostic metrics defines the uniqueness of this research.

Journal ArticleDOI
TL;DR: In this paper , a robust visual tracking model using a very deep generator (RTDG) was proposed by integrating a generative adversarial network (GAN) into the CNN to enhance the tracking results through an adversarial learning process performed during the training phase.
Abstract: Abstract Deep learning algorithms provide visual tracking robustness at an unprecedented level, but realizing an acceptable performance is still challenging because of the natural continuous changes in the features of foreground and background objects over videos. One of the factors that most affects the robustness of tracking algorithms is the choice of network architecture parameters, especially the depth. A robust visual tracking model using a very deep generator (RTDG) was proposed in this study. We constructed our model on an ordinary convolutional neural network (CNN), which consists of feature extraction and binary classifier networks. We integrated a generative adversarial network (GAN) into the CNN to enhance the tracking results through an adversarial learning process performed during the training phase. We used the discriminator as a classifier and the generator as a store that produces unlabeled feature-level data with different appearances by applying masks to the extracted features. In this study, we investigated the role of increasing the number of fully connected (FC) layers in adversarial generative networks and their impact on robustness. We used a very deep FC network with 22 layers as a high-performance generator for the first time. This generator is used via adversarial learning to augment the positive samples to reduce the gap between the hungry deep learning algorithm and the available training data to achieve robust visual tracking. The experiments showed that the proposed framework performed well against state-of-the-art trackers on OTB-100, VOT2019, LaSOT and UAVDT benchmark datasets.

Journal ArticleDOI
TL;DR: In this paper , the authors provide a comprehensive taxonomy and a bird's eye view of healthcare KG construction, and a thorough examination of the current state-of-the-art techniques drawn from academic works relevant to various healthcare contexts.
Abstract: Abstract The incorporation of data analytics in the healthcare industry has made significant progress, driven by the demand for efficient and effective big data analytics solutions. Knowledge graphs (KGs) have proven utility in this arena and are rooted in a number of healthcare applications to furnish better data representation and knowledge inference. However, in conjunction with a lack of a representative KG construction taxonomy, several existing approaches in this designated domain are inadequate and inferior. This paper is the first to provide a comprehensive taxonomy and a bird’s eye view of healthcare KG construction. Additionally, a thorough examination of the current state-of-the-art techniques drawn from academic works relevant to various healthcare contexts is carried out. These techniques are critically evaluated in terms of methods used for knowledge extraction, types of the knowledge base and sources, and the incorporated evaluation protocols. Finally, several research findings and existing issues in the literature are reported and discussed, opening horizons for future research in this vibrant area.