scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2019"


Journal ArticleDOI
TL;DR: This article provides some necessary definitions to discriminate between explainability and causability as well as a use‐case of DL interpretation and of human explanation in histopathology, and argues that there is a need to go beyond explainable AI.
Abstract: Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the problem of explainability is as old as AI itself and classic AI represented comprehensible retraceable approaches. However, their weakness was in dealing with uncertainties of the real world. Through the introduction of probabilistic learning, applications became increasingly successful, but increasingly opaque. Explainable AI deals with the implementation of transparency and traceability of statistical black-box machine learning methods, particularly deep learning (DL). We argue that there is a need to go beyond explainable AI. To reach a level of explainable medicine we need causability. In the same way that usability encompasses measurements for the quality of use, causability encompasses measurements for the quality of explanations. In this article, we provide some necessary definitions to discriminate between explainability and causability as well as a use-case of DL interpretation and of human explanation in histopathology. The main contribution of this article is the notion of causability, which is differentiated from explainability in that causability is a property of a person, while explainability is a property of a system This article is categorized under: Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction.

723 citations


Journal ArticleDOI
TL;DR: A literature review on the parameters' influence on the prediction performance and on variable importance measures is provided, and the application of one of the most established tuning strategies, model‐based optimization (MBO), is demonstrated.
Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

559 citations


Journal ArticleDOI
TL;DR: This work analyzes how this task has been considered during the last decades by considering centralized systems as well as parallel (shared or nonshared memory) architectures and solutions can be divided into exhaustive search and nonexhaustive search models.
Abstract: Frequent itemset mining (FIM) is an essential task within data analysis since it is responsible for extracting frequently occurring events, patterns, or items in data. Insights from such pattern analysis offer important benefits in decision‐making processes. However, algorithmic solutions for mining such kind of patterns are not straightforward since the computational complexity exponentially increases with the number of items in data. This issue, together with the significant memory consumption that is present in the mining process, makes it necessary to propose extremely efficient solutions. Since the FIM problem was first described in the early 1990s, multiple solutions have been proposed by considering centralized systems as well as parallel (shared or nonshared memory) architectures. Solutions can also be divided into exhaustive search and nonexhaustive search models. Many of such approaches are extensions of other solutions and it is therefore necessary to analyze how this task has been considered during the last decades.

122 citations


Journal ArticleDOI
TL;DR: This work presents a systematic overview of the current status of the Educational Text Mining field, answering three main research questions: which are the text mining techniques most used in educational environments?
Abstract: The explosive growth of online education environments is generating a massive volume of data, specially in text format from forums, chats, social networks, assessments, essays, among others. It produces exciting challenges on how to mine text data in order to find useful knowledge for educational stakeholders. Despite the increasing number of educational applications of text mining published recently, we have not found any paper surveying them. In this line, this work presents a systematic overview of the current status of the Educational Text Mining field. Our final goal is to answer three main research questions: Which are the text mining techniques most used in educational environments? Which are the most used educational resources? And which are the main applications or educational goals? Finally, we outline the conclusions and the more interesting future trends.

98 citations


Journal ArticleDOI
TL;DR: A thorough experimental analysis in a series of big datasets is carried out that provides guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process.
Abstract: The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data -likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed.

89 citations


Journal ArticleDOI
TL;DR: There have been various NN‐based approaches proposed for short‐term traffic state prediction that are surveyed in this article, where the existing NN models are classified and their application to this area is reviewed.
Abstract: Traffic state prediction is a key component in intelligent transport systems (ITS) and has attracted much attention over the last few decades. Advances in computational power and availability of a large amount of data have paved the way to employ advanced neural network (NN) models for ITS, including deep architectures. There have been various NN‐based approaches proposed for short‐term traffic state prediction that are surveyed in this article, where the existing NN models are classified and their application to this area is reviewed. An in‐depth discussion is provided to demonstrate how different types of NNs have been used for different aspects of short‐term traffic state prediction. Finally, possible further research directions are suggested for additional applications of NN models, especially using deep architectures, to address the dynamic nature in complex transportation networks.

73 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a broad range of computational techniques to improve applicability, run time, and memory management of automatic differentiation packages, including operation overloading, region based memory, and expression templates.
Abstract: Derivatives play a critical role in computational statistics, examples being Bayesian inference using Hamiltonian Monte Carlo sampling and the training of neural networks. Automatic differentiation is a powerful tool to automate the calculation of derivatives and is preferable to more traditional methods, especially when differentiating complex algorithms and mathematical functions. The implementation of automatic differentiation however requires some care to insure efficiency. Modern differentiation packages deploy a broad range of computational techniques to improve applicability, run time, and memory management. Among these techniques are operation overloading, region based memory, and expression templates. There also exist several mathematical techniques which can yield high performance gains when applied to complex algorithms. For example, semi-analytical derivatives can reduce by orders of magnitude the runtime required to numerically solve and differentiate an algebraic equation. Open problems include the extension of current packages to provide more specialized routines, and efficient methods to perform higher-order differentiation.

68 citations


Journal ArticleDOI
TL;DR: A systematic review of the literature on smart city big data analytics, a technological and thematic analysis of the shortlisted literature, and a classification model that studies four aspects of research in this domain are presented.
Abstract: © 2019 Wiley Periodicals, Inc. With the increasing role of ICT in enabling and supporting smart cities, the demand for big data analytics solutions is increasing. Various artificial intelligence, data mining, machine learning and statistical analysis-based solutions have been successfully applied in thematic domains like climate science, energy management, transport, air quality management and weather pattern analysis. In this paper, we present a systematic review of the literature on smart city big data analytics. We have searched a number of different repositories using specific keywords and followed a structured data mining methodology for selecting material for the review. We have also performed a technological and thematic analysis of the shortlisted literature, identified various data mining/machine learning techniques and presented the results. Based on this analysis we also present a classification model that studies four aspects of research in this domain. These include data models, computing models, security and privacy aspects and major market drivers in the smart cities domain. Moreover, we present a gap analysis and identify future directions for research. For the thematic analysis we identified the themes smart city governance, economy, environment, transport and energy. We present the major challenges in these themes, the major research work done in the field of data analytics to address these challenges and future research directions. This article is categorized under: Application Areas > Government and Public Sector Fundamental Concepts of Data and Knowledge > Big Data Mining.

66 citations


Journal ArticleDOI
TL;DR: This advanced review describes the historical profile of the shallow feature learning research and introduces the important developments of the deep learning models, and surveys the deep architectures with benefits from the optimization of their width and depth.
Abstract: Since Pearson developed principal component analysis (PCA) in 1901, feature learning (or called representation learning) has been studied for more than 100 years. During this period, many “shallow” feature learning methods have been proposed based on various learning criteria and techniques, until the popular deep learning research in recent years. In this advanced review, we describe the historical profile of the shallow feature learning research and introduce the important developments of the deep learning models. Particularly, we survey the deep architectures with benefits from the optimization of their width and depth, as these models have achieved new records in many applications, such as image classification and object detection. Finally, several interesting directions of deep learning are presented and briefly discussed.

63 citations


Journal ArticleDOI
TL;DR: In this review, several areas of cybersecurity where machine learning is used as a tool are discussed and a few glimpses of adversarial attacks on machine learning algorithms to manipulate training and test data of classifiers, to render such tools ineffective are provided.
Abstract: Machine learning technology has become mainstream in a large number of domains, and cybersecurity applications of machine learning techniques are plenty. Examples include malware analysis, especially for zero‐day malware detection, threat analysis, anomaly based intrusion detection of prevalent attacks on critical infrastructures, and many others. Due to the ineffectiveness of signature‐based methods in detecting zero day attacks or even slight variants of known attacks, machine learning‐based detection is being used by researchers in many cybersecurity products. In this review, we discuss several areas of cybersecurity where machine learning is used as a tool. We also provide a few glimpses of adversarial attacks on machine learning algorithms to manipulate training and test data of classifiers, to render such tools ineffective.

62 citations


Journal ArticleDOI
TL;DR: The main objective of this study is to provide a comprehensive survey of error measures for evaluating the outcome of binary decision making applicable to many data‐driven fields.
Abstract: Binary decision making is a topic of great interest for many fields, including biomedical science, economics, management, politics, medicine, natural science and social science, and much effort has been spent for developing novel computational methods to address problems arising in the aforementioned fields. However, in order to evaluate the effectiveness of any prediction method for binary decision making, the choice of the most appropriate error measures is of paramount importance. Due to the variety of error measures available, the evaluation process of binary decision making can be a complex task. The main objective of this study is to provide a comprehensive survey of error measures for evaluating the outcome of binary decision making applicable to many data-driven fields. This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data MiningTechnologies > PredictionAlgorithmic Development > Statistics.

Journal ArticleDOI
TL;DR: This work identifies key parametric attributes to assess the clustering algorithms which in turn benevolent the existing work and paves the way for profound future research in this realm.
Abstract: Data mining is an inevitable task in most of the emerging computing technologies as it debilitates the complexity of datasets by rendering a better insight. Moreover, it entails the efficacy to envisage ingeniously the vast and heterogeneous datasets and thus delineates substantial knowledge from the abundance of data by pragmatic implementation of suitable algorithm. There are galore of algorithms in literature for this purpose. Furthermore, clustering is widely used techniques to analyze the data within the purview of data mining and thus it became as a motivational impetus for the authors to survey the existing literature on this topic rigorously and have consequently identified various key parameters so that concomitant improvement can be possible while selecting a best fit clustering algorithm pertaining to a specific problem domain. Furthermore, clustering, classification and association rule mining are akin and indispensable to data mining and owing to these authors have also included interrelation and intertwining among these terms so that this work will presage chunk of help for the researchers working in this field. The present study also envisages and manifests the challenges associated with the clustering algorithms for two‐ and high‐dimensional databases in a flamboyant fashion. Over and above, this work identifies key parametric attributes to assess the clustering algorithms which in turn benevolent the existing work and paves the way for profound future research in this realm.

Journal ArticleDOI
TL;DR: Applying game‐theoretic approach, robust learning techniques have been developed to specifically address adversarial attacks and the preliminary results are promising.
Abstract: The field of machine learning is progressing at a faster pace than ever before. Many organizations leverage machine learning tools to extract useful information from a massive amount of data. In particular, machine learning finds its application in cybersecurity that begins to enter the age of automation. However, machine learning applications in cybersecurity face unique challenges other domains rarely do—attacks from active adversaries. Problems in areas such as intrusion detection, banking fraud detection, spam filtering, and malware detection have to face challenges of adversarial attacks that modify data so that malicious instances would evade detection by the learning systems. The adversarial learning problem naturally resembles a game between the learning system and the adversary. In such a game, both players would attempt to play their best strategies against each other while maximizing their own payoffs. To solve the game, each player would search for an optimal strategy against the opponent based on the prediction of the opponent's strategy choice. The problem becomes even more complicated in settings where the learning system may have to deal with many adversaries of unknown types. Applying game‐theoretic approach, robust learning techniques have been developed to specifically address adversarial attacks and the preliminary results are promising. In this review, we summarize these results.

Journal ArticleDOI
TL;DR: This work reviews 13 subgroup identification methods and uses real‐world and simulated data to compare the performance of their publicly available software using seven criteria, showing that many methods fare poorly on at least one criterion.
Abstract: Natural heterogeneity in patient populations can make it very hard to develop treatments that benefit all patients. As a result, an important goal of precision medicine is identification of patient subgroups that respond to treatment at a much higher (or lower) rate than the population average. Despite there being many subgroup identification methods, there is no comprehensive comparative study of their statistical properties. We review 13 methods and use real‐world and simulated data to compare the performance of their publicly available software using seven criteria: (a) bias in selection of subgroup variables, (b) probability of false discovery, (c) probability of identifying correct predictive variables, (d) bias in estimates of subgroup treatment effects, (e) expected subgroup size, (f) expected true treatment effect of subgroups, and (g) subgroup stability. The results show that many methods fare poorly on at least one criterion.

Journal ArticleDOI
TL;DR: Comparison results indicate that SRmining, PMES, Ant‐ARM, and MDS‐H are the fastest heuristic ARM algorithms, and HSBO‐TS is the most complete one, while SRmining and ACS require only one database scan.
Abstract: Association rule mining (ARM) is a commonly encountred data mining method. There are many approaches to mining frequent rules and patterns from a database and one among them is heuristics. Many heuristic approaches have been proposed but, to the best of our knowledge, there is no comprehensive literature review on such approaches, yet with only a limited attempt. This gap needs to be filled. This paper reviews heuristic approaches to ARM and points out their most significant strengths and weaknesses. We propose eight performance metrics, such as execution time, memory consumption, completeness, and interestingness, we compare approaches against these performance metrics and discuss our findings. For instance, comparison results indicate that SRmining, PMES, Ant‐ARM, and MDS‐H are the fastest heuristic ARM algorithms. HSBO‐TS is the most complete one, while SRmining and ACS require only one database scan. In addition, we propose a parameter, named GT‐Rank for ranking heuristic ARM approaches, and based on that, ARMGA, ASC, and Kua emerge as the best approaches. We also consider ARM algorithms and their characteristics as transactions and items in a transactional database, respectively, and generate association rules that indicate research trends in this area.

Journal ArticleDOI
TL;DR: In this review, more than 90 relevant research studies have been analyzed, describing the most important practical applications, terminological resources, tools, and open challenges of TM in medicine.
Abstract: Health care professionals produce abundant textual information in their daily clinical practice and this information is stored in many diverse sources and, generally, in textual form. The extraction of insights from all the gathered information, mainly unstructured and lacking normalization, is one of the major challenges in computational medicine. In this respect, text mining (TM) assembles different techniques to derive valuable insights from unstructured textual data so it has led to be especially relevant in medicine. The aim of this paper is therefore to provide an extensive review of existing techniques and resources to perform TM tasks in medicine. In this review, more than 90 relevant research studies have been analyzed, describing the most important practical applications, terminological resources, tools, and open challenges of TM in medicine.

Journal ArticleDOI
TL;DR: The sheer amount of data stemmed from devices forming the IoT requires new data mining systems and techniques that are discussed and categorized later in this paper.
Abstract: The Internet of Things (IoT) is the result of the convergence of sensing, computing, and networking technologies, allowing devices of varying sizes and computational capabilities (things) to intercommunicate. This communication can be achieved locally enabling what is known as edge and fog computing, or through the well‐established Internet infrastructure, exploiting the computational resources in the cloud. The IoT paradigm enables a new breed of applications in various areas including health care, energy management and smart cities. This paper starts off with reviewing these applications and their potential benefits. Challenges facing the realization of such applications are then discussed. The sheer amount of data stemmed from devices forming the IoT requires new data mining systems and techniques that are discussed and categorized later in this paper. Finally, the paper is concluded with future research directions.

Journal ArticleDOI
TL;DR: A good overview of the status quo in benchmarking in classification and nonlinear regression can be found in this article, where the authors present guidelines and best practices for benchmarking and discuss performance metrics for a sound statistical comparative analysis.
Abstract: The article presents an overview of the status quo in benchmarking in classification and nonlinear regression. It outlines guidelines for a comparative analysis in machine learning, benchmarking principles, accuracy estimation, and model validation. It provides references to established repositories and competitions and discusses the objectives and limitations of benchmarking. Benchmarking is key to progress in machine learning as it allows an unprejudiced comparison among alternative methods. This article presents guidelines and best practices for benchmarking in classification and regression. It reviews state‐of‐the‐art approaches in machine learning, establishes benchmarking principles and discusses performance metrics for a sound statistical comparative analysis.

Journal ArticleDOI
TL;DR: This paper first constructs an ethnical group face dataset including Chinese Uyghur, Tibetan, and Korean, and constructs three “T” regions in a face image for ethnical feature representation and proves them to be effective areas for ethnicity recognition.
Abstract: The salient facial feature discovery is one of the important research tasks in ethnical group face recognition. In this paper, we first construct an ethnical group face dataset including Chinese Uyghur, Tibetan, and Korean. Then, we show that the effective sparse sensing approach to general face recognition is not working anymore for ethnical group facial recognition if the features based on whole face image are used. This is partially due to a fact that each ethnical group may have its own characteristics manifesting only in specified face regions. Therefore, we will analyze the particularity of three ethnical groups and aim to find the common characterizations in some local regions for the three ethnical groups. For this purpose, we first use the facial landmark detector STASM to find some important landmarks in a face image, then, we use the well‐known data mining technique, the mRMR algorithm, to select the salient geometric length features based on all possible lines connected by any two landmarks. Second, based on these selected salient features, we construct three “T” regions in a face image for ethnical feature representation and prove them to be effective areas for ethnicity recognition. Finally, some extensive experiments are conducted and the results reveal that the proposed “T” regions with extracted features are quite effective for ethnical group facial recognition when the L2‐norm is adopted using the sparse sensing approach. In comparison to face recognition, the proposed three “T” regions are evaluated on the olivetti research laboratory face dataset, and the results show that the constructed “T” regions for ethnicity recognition are not suitable for general face recognition.

Journal ArticleDOI
TL;DR: The state‐of‐the‐art event‐based vision algorithms are reviewed by categorizing them into three major vision applications, object detection/recognition, object tracking, localization and mapping, which enables more robustness.
Abstract: Regardless of the marvels brought by the conventional frame-based cameras, they have significant drawbacks due to their redundancy in data and temporal latency. This causes problem in applications where low-latency transmission and high-speed processing are mandatory. Proceeding along this line of thought, the neurobiological principles of the biological retina have been adapted to accomplish data sparsity and high dynamic range at the pixel level. These bio-inspired neuromorphic vision sensors alleviate the more serious bottleneck of data redundancy by responding to changes in illumination rather than to illumination itself. This paper reviews in brief one such representative of neuromorphic sensors, the activity-driven event-based vision sensor, which mimics human eyes. Spatio-temporal encoding of event data permits incorporation of time correlation in addition to spatial correlation in vision processing, which enables more robustness. Henceforth, the conventional vision algorithms have to be reformulated to adapt to this new generation vision sensor data. It involves design of algorithms for sparse, asynchronous, and accurately timed information. Theories and new researches have begun emerging recently in the domain of event-based vision. The necessity to compile the vision research carried out in this sensor domain has turned out to be considerably more essential. Towards this, this paper reviews the state-of-the-art event-based vision algorithms by categorizing them into three major vision applications, object detection/recognition, object tracking, localization and mapping. This article is categorized under: Technologies > Machine Learning

Journal ArticleDOI
TL;DR: This paper describes and evaluates the following popular Big Data processing tools: Drill, HAWQ, Hive, Impala, Presto, and Spark, and highlights the performance of each tool, according to different workloads and query types.
Abstract: FCT – Fundacao para a Ciencia e Tecnologia, Grant/Award Number: UID/CEC/00319/2013; COMPETE, Grant/Award Number: POCI01-0145-FEDER-007043

Journal ArticleDOI
TL;DR: In this article, the authors present approaches for model-based clustering and classification of functional data, and derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering, missing information, and dynamical hidden structure.
Abstract: The problem of complex data analysis is a central topic of modern statistical science and learning systems and is becoming of broader interest with the increasing prevalence of high-dimensional data. The challenge is to develop statistical models and autonomous algorithms that are able to acquire knowledge from raw data for exploratory analysis, which can be achieved through clustering techniques or to make predictions of future data via classification (i.e., discriminant analysis) techniques. Latent data models, including mixture model-based approaches are one of the most popular and successful approaches in both the unsupervised context (i.e., clustering) and the supervised one (i.e, classification or discrimination). Although traditionally tools of multivariate analysis, they are growing in popularity when considered in the framework of functional data analysis (FDA). FDA is the data analysis paradigm in which the individual data units are functions (e.g., curves, surfaces), rather than simple vectors. In many areas of application, the analyzed data are indeed often available in the form of discretized values of functions or curves (e.g., time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data). This functional aspect of the data adds additional difficulties compared to the case of a classical multivariate (non-functional) data analysis. We review and present approaches for model-based clustering and classification of functional data. We derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering and the classification of these high-dimensional data, including their heterogeneity, missing information, and dynamical hidden structure. The presented models and algorithms are illustrated on real-world functional data analysis problems from several application area.

Journal ArticleDOI
TL;DR: This survey studies the EFA applied to data mining, focusing on the problem of establishing of the optimal number of factors to be retained, and the main focus was on the most frequently applied factor selection methods, namely Kaiser Criterion, Cattell's Scree test, and Monte Carlo Parallel Analysis.
Abstract: In many types of researches and studies including those performed by the sciences of agriculture and plant sciences, large quantities of data are frequently obtained that must be analyzed using different data mining techniques. Sometimes data mining involves the application of different methods of statistical data analysis. Exploratory Factor Analysis (EFA) is frequently used as a technique for data reduction and structure detection in data mining. In our survey, we study the EFA applied to data mining, focusing on the problem of establishing of the optimal number of factors to be retained. The number of factors to retain is the most important decision to take after the factor extraction in EFA. Many researchers discussed the criteria for choosing the optimal number of factors. Mistakes in factor extraction may consist in extracting too few or too many factors. An inappropriate number of factors may lead to erroneous conclusions. A comprehensive review of the state‐of‐the‐art related to this subject was made. The main focus was on the most frequently applied factor selection methods, namely Kaiser Criterion, Cattell's Scree test, and Monte Carlo Parallel Analysis. We have highligthed the importance of the analysis in some research, based on the research specificity, of the total cumulative variance explained by the selected optimal number of extracted factors. It is necessary that the extracted factors explain at least a minimum threshold of cumulative variance. ExtrOptFact algorithm presents the steps that must be performed in EFA for the selection of the optimal number of factors. For validation purposes, a case study was presented, performed on data obtained in an experimental study that we made on Brassica napus plant. Applying the ExtrOptFact algorithm for Principal Component Analysis can be decided on the selection of three components that were called Qualitative, Generative, and Vegetative, which explained 92% of the total cumulative variance.

Journal ArticleDOI
TL;DR: This study proposes an objective risk measurement method for the lending process of SME commercial corporate customers and performs classification task of data mining by collecting current customer data on credit evaluation process of a bank.
Abstract: The constant need to assess loans makes risk evaluation a very important problem for the banking sector. A crucial function of the banks is to fund households and companies from various industries in the economy. Risk is taken by the banks as soon as a loan is given to an entity. Currently, there are sector‐and‐experience based methods of analysis employed by the banks to estimate the risks to be taken. For the credit process, there exist a large number of studies in the literature on scoring individual clients but there are very few studies on scoring small and medium enterprises (SME) commercial corporate customers. In this study, we propose an objective risk measurement method for the lending process of SME commercial corporate customers and performed classification task of data mining by collecting current customer data on credit evaluation process of a bank. For this purpose, we first create a risk measure by looking into the risks identified for existing customers by the analysts of a bank. These scores are used as target variable in the classification process. Then, we extract rules for estimating these scores using Weka software. We used six different algorithms, and compared results in terms of test accuracy, the number of rules, recall, precision and Kappa statistic. We obtained high accuracy rates on real life data by our approach. As a result, we showed that an objective evaluation strategy is possible to use in the lending process for SME commercial corporate customers in the banking system using data mining.

Journal ArticleDOI
TL;DR: Analysis of tweets during a service disruption of a leading Australian organization as a case study found that sarcastic expressions during the service disruption are higher than on regular days and negative sarcastic tweets attract significantly higher social media responses when compared to literal negative expressions.
Abstract: Sarcasm in verbal and nonverbal communication is known to attract higher attention and create deeper influence than other negative responses. Many people are adept at including sarcasm in written communication thus sarcastic comments have the potential to stimulate the virality of social media content. Although diverse computational approaches have been used to detect sarcasm in social media, the use of text mining to explore the influential role of sarcasm in spreading negative content is limited. Using tweets during a service disruption of a leading Australian organization as a case study, we explore this phenomenon using a text mining framework with a combination of statistical modeling and natural language processing (NLP) techniques. Our work targets two main outcomes: the quantification of the influence of sarcasm and the exploration of the change in topical relationships in the conversations over time. We found that sarcastic expressions during the service disruption are higher than on regular days and negative sarcastic tweets attract significantly higher social media responses when compared to literal negative expressions. The content analysis showed that consumers initially complaining sarcastically about the outage tended to eventually widen the negative sarcasm in a cascading effect towards the organization's internal issues and strategies. Organizations could utilize such insights to enable proactive decision‐making during crisis situations. Moreover, detailed exploration of these impacts would elevate the current text mining applications, to better understand the impact of sarcasm by stakeholders expressed in a social media environment, which can significantly affect the reputation and goodwill of an organization.

Journal ArticleDOI
TL;DR: A new research topic is introduced—transformative knowledge discovery—that provides a research ground to study and develop smart machine learning models and algorithms that are automatic, adaptive, and cognitive to address big data analytics problems and challenges.
Abstract: Big data analytics provides an interdisciplinary framework that is essential to support the current trend for solving real-world problems collaboratively. The progression of big data analytics framework must be clearly understood so that novel approaches can be developed to advance this state-of-the-art discipline. An ignorance of observing the progression of this fast-growing discipline may lead to duplications in research and waste of efforts. Its main companion field, machine learning, helps solve many big data analytics problems; therefore, it is also important to understand the progression of machine learning in the big data analytics framework. One of the current research efforts in big data analytics is the integration of deep learning and Bayesian optimization, which can help the automatic initialization and optimization of hyperparameters of deep learning and enhance the implementation of iterative algorithms in software. The hyperparameters include the weights used in deep learning, and the number of clusters in Bayesian mixture models that characterize data heterogeneity. The big data analytics research also requires computer systems and software that are capable of storing, retrieving, processing, and analyzing big data that are generally large, complex, heterogeneous, unstructured, unpredictable, and exposed to scalability problems. Therefore, it is appropriate to introduce a new research topic—transformative knowledge discovery—that provides a research ground to study and develop smart machine learning models and algorithms that are automatic, adaptive, and cognitive to address big data analytics problems and challenges. The new research domain will also create research opportunities to work on this interdisciplinary research space and develop solutions to support research in other disciplines that may not have expertise in the research area of big data analytics. For example, the research, such as detection and characterization of retinal diseases in medical sciences and the classification of highly interacting species in environmental sciences can benefit from the knowledge and expertise in big data analytics.

Journal ArticleDOI
TL;DR: Special focus is given to Genetic Programming‐based EFSs by providing a taxonomy of the main architectures available, as well as by pointing out the gaps that still prevail in the literature.
Abstract: Studies in Evolutionary Fuzzy Systems (EFSs) began in the 90s and have experienced a fast development since then, with applications to areas such as pattern recognition, curve‐fitting and regression, forecasting and control. An EFS results from the combination of a Fuzzy Inference System (FIS) with an Evolutionary Algorithm (EA). This relationship can be established for multiple purposes: fine‐tuning of FIS's parameters, selection of fuzzy rules, learning a rule base or membership functions from scratch, and so forth. Each facet of this relationship creates a strand in the literature, as membership function fine‐tuning, fuzzy rule‐based learning, and so forth and the purpose here is to outline some of what has been done in each aspect. Special focus is given to Genetic Programming‐based EFSs by providing a taxonomy of the main architectures available, as well as by pointing out the gaps that still prevail in the literature. The concluding remarks address some further topics of current research and trends, such as interpretability analysis, multiobjective optimization, and synthesis of a FIS through Evolving methods.

Journal ArticleDOI
TL;DR: This paper presents an efficient crack detection method in the tunnel concrete structure based on digital image processing and deep learning, and introduces a faster region convolutional neural network to develop a coarse crack region localization and classification, then deploy edge extraction to implement the fine crack edge detection.
Abstract: Detecting cracks on the concrete surface is crucial for the tunnel health monitoring and maintenance of Chinese transport facilities, since it is closely related with the structural health and reliability. The automated and efficient tunnel crack detection recently has attracted more research studies, particularly cheap availability of digital cameras makes this issue easier. However, it is still a challenging task due to concrete blebs, stains, and illumination over the concrete surface. This paper presents an efficient crack detection method in the tunnel concrete structure based on digital image processing and deep learning. Three contributions of the paper are summarized as follows. First, we collect and annotate a tunnel crack dataset including three kinds of common cracks that might benefit the research in the field. Second, we propose a new coarse‐to‐fine crack detection method using improved preprocessing, coarse crack region localization and classification, and fine crack edge detection. Third, we introduce a faster region convolutional neural network to develop a coarse crack region localization and classification, then deploy edge extraction to implement the fine crack edge detection, gaining a high‐efficiency and high‐accuracy performance.

Journal ArticleDOI
TL;DR: The analysis of the reviewed works indicates the promising future of such methods, especially decomposition‐based approaches; however, much still need to be done to develop more robust, faster, and predictable evolutionary many‐objective algorithms.
Abstract: Multiobjective evolutionary algorithms (MOEAs) effectively solve several complex optimization problems with two or three objectives. However, when they are applied to many‐objective optimization, that is, when more than three criteria are simultaneously considered, the performance of most MOEAs is severely affected. Several alternatives have been reported to reproduce the same performance level that MOEAs have achieved in problems with up to three objectives when considering problems with higher dimensions. This work briefly reviews the main search difficulties, visualization, evaluation of algorithms, and new procedures in many‐objective optimization using evolutionary methods. Approaches for the development of evolutionary many‐objective algorithms are classified into: (a) based on preference relations, (b) aggregation‐based, (c) decomposition‐based, (d) indicator‐based, and (e) based on dimensionality reduction. The analysis of the reviewed works indicates the promising future of such methods, especially decomposition‐based approaches; however, much still need to be done to develop more robust, faster, and predictable evolutionary many‐objective algorithms.

Journal ArticleDOI
TL;DR: An overview of the two fields of sequential pattern mining and stream‐based process discovery is provided, covering their commonalities and differences, highlight the challenges of applying them, and, present an outlook and several avenues for future work.
Abstract: Sequential pattern mining (SPM) is a well-studied theme in data mining, in which one aims to discover common sequences of item sets in a large corpus of temporal itemset data. Due to the sequential nature of data streams, supporting SPM in streaming environments is commonly studied in the area of data stream mining as well. On the other hand, stream-based process discovery (PD), originating from the field of process mining, focusses on learning process models on the basis of online event data. In particular, the main goal of the models discovered is to describe the underlying generating process in an end-to-end fashion. As both SPM and PD use data that are comparable in nature, that is, both involve time-stamped instances, one expects that techniques from the SPM domain are (partly) transferable to the PD domain. However, thus far, little work has been done in the intersection of the two fields. In this focus article, we therefore study the possible application of SPM techniques in the context of PD. We provide an overview of the two fields, covering their commonalities and differences, highlight the challenges of applying them, and, present an outlook and several avenues for future work. This article is categorized under: Algorithmic Development > Spatial and Temporal Data Mining Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining.