Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2019"

PDF

Open Access

Journal Article•DOI•

Causability and explainability of artificial intelligence in medicine.

[...]

Andreas Holzinger¹, Georg Langs², Helmut Denk¹, Kurt Zatloukal¹, Heimo Müller¹ - Show less +1 more•Institutions (2)

University of Graz¹, Medical University of Vienna²

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article provides some necessary definitions to discriminate between explainability and causability as well as a use‐case of DL interpretation and of human explanation in histopathology, and argues that there is a need to go beyond explainable AI.

...read moreread less

Abstract: Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the problem of explainability is as old as AI itself and classic AI represented comprehensible retraceable approaches. However, their weakness was in dealing with uncertainties of the real world. Through the introduction of probabilistic learning, applications became increasingly successful, but increasingly opaque. Explainable AI deals with the implementation of transparency and traceability of statistical black-box machine learning methods, particularly deep learning (DL). We argue that there is a need to go beyond explainable AI. To reach a level of explainable medicine we need causability. In the same way that usability encompasses measurements for the quality of use, causability encompasses measurements for the quality of explanations. In this article, we provide some necessary definitions to discriminate between explainability and causability as well as a use-case of DL interpretation and of human explanation in histopathology. The main contribution of this article is the notion of causability, which is differentiated from explainability in that causability is a property of a person, while explainability is a property of a system This article is categorized under: Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction.

...read moreread less

723 citations

Journal Article•DOI•

Hyperparameters and tuning strategies for random forest

[...]

Philipp Probst¹, Marvin N. Wright², Anne-Laure Boulesteix¹•Institutions (2)

Ludwig Maximilian University of Munich¹, Leibniz Association²

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A literature review on the parameters' influence on the prediction performance and on variable importance measures is provided, and the application of one of the most established tuning strategies, model‐based optimization (MBO), is demonstrated.

...read moreread less

Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

...read moreread less

559 citations

Journal Article•DOI•

Frequent itemset mining: A 25 years review

[...]

José María Luna¹, Philippe Fournier-Viger², Sebastián Ventura¹, Sebastián Ventura³•Institutions (3)

University of Córdoba (Spain)¹, Harbin Institute of Technology², King Abdulaziz University³

01 Nov 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This work analyzes how this task has been considered during the last decades by considering centralized systems as well as parallel (shared or nonshared memory) architectures and solutions can be divided into exhaustive search and nonexhaustive search models.

...read moreread less

Abstract: Frequent itemset mining (FIM) is an essential task within data analysis since it is responsible for extracting frequently occurring events, patterns, or items in data. Insights from such pattern analysis offer important benefits in decision‐making processes. However, algorithmic solutions for mining such kind of patterns are not straightforward since the computational complexity exponentially increases with the number of items in data. This issue, together with the significant memory consumption that is present in the mining process, makes it necessary to propose extremely efficient solutions. Since the FIM problem was first described in the early 1990s, multiple solutions have been proposed by considering centralized systems as well as parallel (shared or nonshared memory) architectures. Solutions can also be divided into exhaustive search and nonexhaustive search models. Many of such approaches are extensions of other solutions and it is therefore necessary to analyze how this task has been considered during the last decades.

...read moreread less

122 citations

Journal Article•DOI•

Text mining in education

[...]

Rafael Ferreira-Mello¹, Máverick André¹, Anderson Pinheiro¹, Evandro Costa², Cristóbal Romero³ - Show less +1 more•Institutions (3)

Universidade Federal Rural de Pernambuco¹, Federal University of Alagoas², University of Córdoba (Spain)³

01 Nov 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This work presents a systematic overview of the current status of the Educational Text Mining field, answering three main research questions: which are the text mining techniques most used in educational environments?

...read moreread less

Abstract: The explosive growth of online education environments is generating a massive volume of data, specially in text format from forums, chats, social networks, assessments, essays, among others. It produces exciting challenges on how to mine text data in order to find useful knowledge for educational stakeholders. Despite the increasing number of educational applications of text mining published recently, we have not found any paper surveying them. In this line, this work presents a systematic overview of the current status of the Educational Text Mining field. Our final goal is to answer three main research questions: Which are the text mining techniques most used in educational environments? Which are the most used educational resources? And which are the main applications or educational goals? Finally, we outline the conclusions and the more interesting future trends.

...read moreread less

98 citations

Journal Article•DOI•

Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

[...]

Isaac Triguero¹, Diego García-Gil², Jesus Maillo², Julián Luengo², Salvador García², Francisco Herrera² - Show less +2 more•Institutions (2)

University of Nottingham¹, University of Granada²

01 Mar 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A thorough experimental analysis in a series of big datasets is carried out that provides guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process.

...read moreread less

Abstract: The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data -likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed.

...read moreread less

89 citations

Journal Article•DOI•

Survey of neural network-based models for short-term traffic state prediction

[...]

Loan N. N. Do¹, Neda Taherifar¹, Hai L. Vu²•Institutions (2)

Swinburne University of Technology¹, Monash University²

01 Jan 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: There have been various NN‐based approaches proposed for short‐term traffic state prediction that are surveyed in this article, where the existing NN models are classified and their application to this area is reviewed.

...read moreread less

Abstract: Traffic state prediction is a key component in intelligent transport systems (ITS) and has attracted much attention over the last few decades. Advances in computational power and availability of a large amount of data have paved the way to employ advanced neural network (NN) models for ITS, including deep architectures. There have been various NN‐based approaches proposed for short‐term traffic state prediction that are surveyed in this article, where the existing NN models are classified and their application to this area is reviewed. An in‐depth discussion is provided to demonstrate how different types of NNs have been used for different aspects of short‐term traffic state prediction. Finally, possible further research directions are suggested for additional applications of NN models, especially using deep architectures, to address the dynamic nature in complex transportation networks.

...read moreread less

73 citations

Journal Article•DOI•

A review of automatic differentiation and its efficient implementation

[...]

Charles C. Margossian¹•Institutions (1)

Columbia University¹

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this paper, the authors present a broad range of computational techniques to improve applicability, run time, and memory management of automatic differentiation packages, including operation overloading, region based memory, and expression templates.

...read moreread less

Abstract: Derivatives play a critical role in computational statistics, examples being Bayesian inference using Hamiltonian Monte Carlo sampling and the training of neural networks. Automatic differentiation is a powerful tool to automate the calculation of derivatives and is preferable to more traditional methods, especially when differentiating complex algorithms and mathematical functions. The implementation of automatic differentiation however requires some care to insure efficiency. Modern differentiation packages deploy a broad range of computational techniques to improve applicability, run time, and memory management. Among these techniques are operation overloading, region based memory, and expression templates. There also exist several mathematical techniques which can yield high performance gains when applied to complex algorithms. For example, semi-analytical derivatives can reduce by orders of magnitude the runtime required to numerically solve and differentiate an algebraic equation. Open problems include the extension of current packages to provide more specialized routines, and efficient methods to perform higher-order differentiation.

...read moreread less

68 citations

Journal Article•DOI•

Smart city big data analytics: An advanced review

[...]

Kamran Soomro¹, Muhammad Nasir Mumtaz Bhutta², Zaheer Khan¹, Muhammad Atif Tahir³•Institutions (3)

University of the West of England¹, King Faisal University², National University of Computer and Emerging Sciences³

01 Sep 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A systematic review of the literature on smart city big data analytics, a technological and thematic analysis of the shortlisted literature, and a classification model that studies four aspects of research in this domain are presented.

...read moreread less

Abstract: © 2019 Wiley Periodicals, Inc. With the increasing role of ICT in enabling and supporting smart cities, the demand for big data analytics solutions is increasing. Various artificial intelligence, data mining, machine learning and statistical analysis-based solutions have been successfully applied in thematic domains like climate science, energy management, transport, air quality management and weather pattern analysis. In this paper, we present a systematic review of the literature on smart city big data analytics. We have searched a number of different repositories using specific keywords and followed a structured data mining methodology for selecting material for the review. We have also performed a technological and thematic analysis of the shortlisted literature, identified various data mining/machine learning techniques and presented the results. Based on this analysis we also present a classification model that studies four aspects of research in this domain. These include data models, computing models, security and privacy aspects and major market drivers in the smart cities domain. Moreover, we present a gap analysis and identify future directions for research. For the thematic analysis we identified the themes smart city governance, economy, environment, transport and energy. We present the major challenges in these themes, the major research work done in the field of data analytics to address these challenges and future research directions. This article is categorized under: Application Areas > Government and Public Sector Fundamental Concepts of Data and Knowledge > Big Data Mining.

...read moreread less

66 citations

Journal Article•DOI•

From shallow feature learning to deep learning: Benefits from the width and depth of deep architectures

[...]

Guoqiang Zhong¹, Xiao Ling¹, Li-Na Wang¹•Institutions (1)

Ocean University of China¹

01 Jan 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This advanced review describes the historical profile of the shallow feature learning research and introduces the important developments of the deep learning models, and surveys the deep architectures with benefits from the optimization of their width and depth.

...read moreread less

Abstract: Since Pearson developed principal component analysis (PCA) in 1901, feature learning (or called representation learning) has been studied for more than 100 years. During this period, many “shallow” feature learning methods have been proposed based on various learning criteria and techniques, until the popular deep learning research in recent years. In this advanced review, we describe the historical profile of the shallow feature learning research and introduce the important developments of the deep learning models. Particularly, we survey the deep architectures with benefits from the optimization of their width and depth, as these models have achieved new records in many applications, such as image classification and object detection. Finally, several interesting directions of deep learning are presented and briefly discussed.

...read moreread less

63 citations

Journal Article•DOI•

Machine learning in cybersecurity: A review

[...]

Anand Handa¹, Ashu Sharma¹, Sandeep K. Shukla¹•Institutions (1)

Indian Institute of Technology Kanpur¹

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this review, several areas of cybersecurity where machine learning is used as a tool are discussed and a few glimpses of adversarial attacks on machine learning algorithms to manipulate training and test data of classifiers, to render such tools ineffective are provided.

...read moreread less

Abstract: Machine learning technology has become mainstream in a large number of domains, and cybersecurity applications of machine learning techniques are plenty. Examples include malware analysis, especially for zero‐day malware detection, threat analysis, anomaly based intrusion detection of prevalent attacks on critical infrastructures, and many others. Due to the ineffectiveness of signature‐based methods in detecting zero day attacks or even slight variants of known attacks, machine learning‐based detection is being used by researchers in many cybersecurity products. In this review, we discuss several areas of cybersecurity where machine learning is used as a tool. We also provide a few glimpses of adversarial attacks on machine learning algorithms to manipulate training and test data of classifiers, to render such tools ineffective.

...read moreread less

62 citations

Journal Article•DOI•

A comprehensive survey of error measures for evaluating binary decision making in data science

[...]

Frank Emmert-Streib, Salissou Moutari¹, Matthias Dehmer², Matthias Dehmer³•Institutions (3)

Queen's University Belfast¹, Nankai University², Steyr Mannlicher³

01 Sep 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The main objective of this study is to provide a comprehensive survey of error measures for evaluating the outcome of binary decision making applicable to many data‐driven fields.

...read moreread less

Abstract: Binary decision making is a topic of great interest for many fields, including biomedical science, economics, management, politics, medicine, natural science and social science, and much effort has been spent for developing novel computational methods to address problems arising in the aforementioned fields. However, in order to evaluate the effectiveness of any prediction method for binary decision making, the choice of the most appropriate error measures is of paramount importance. Due to the variety of error measures available, the evaluation process of binary decision making can be a complex task. The main objective of this study is to provide a comprehensive survey of error measures for evaluating the outcome of binary decision making applicable to many data-driven fields. This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data MiningTechnologies > PredictionAlgorithmic Development > Statistics.

...read moreread less

Journal Article•DOI•

Clustering approaches for high-dimensional databases: A review

[...]

Mamta Mittal¹, Lalit Mohan Goyal², Duraisamy Jude Hemanth³, Jasleen Kaur Sethi⁴•Institutions (4)

Government Engineering College, Sreekrishnapuram¹, Bose Corporation², Karunya University³, Guru Gobind Singh Indraprastha University⁴

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This work identifies key parametric attributes to assess the clustering algorithms which in turn benevolent the existing work and paves the way for profound future research in this realm.

...read moreread less

Abstract: Data mining is an inevitable task in most of the emerging computing technologies as it debilitates the complexity of datasets by rendering a better insight. Moreover, it entails the efficacy to envisage ingeniously the vast and heterogeneous datasets and thus delineates substantial knowledge from the abundance of data by pragmatic implementation of suitable algorithm. There are galore of algorithms in literature for this purpose. Furthermore, clustering is widely used techniques to analyze the data within the purview of data mining and thus it became as a motivational impetus for the authors to survey the existing literature on this topic rigorously and have consequently identified various key parameters so that concomitant improvement can be possible while selecting a best fit clustering algorithm pertaining to a specific problem domain. Furthermore, clustering, classification and association rule mining are akin and indispensable to data mining and owing to these authors have also included interrelation and intertwining among these terms so that this work will presage chunk of help for the researchers working in this field. The present study also envisages and manifests the challenges associated with the clustering algorithms for two‐ and high‐dimensional databases in a flamboyant fashion. Over and above, this work identifies key parametric attributes to assess the clustering algorithms which in turn benevolent the existing work and paves the way for profound future research in this realm.

...read moreread less

Journal Article•DOI•

A survey of game theoretic approach for adversarial machine learning

[...]

Yan Zhou¹, Murat Kantarcioglu¹, Bowei Xi²•Institutions (2)

University of Texas at Dallas¹, Purdue University²

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Applying game‐theoretic approach, robust learning techniques have been developed to specifically address adversarial attacks and the preliminary results are promising.

...read moreread less

Abstract: The field of machine learning is progressing at a faster pace than ever before. Many organizations leverage machine learning tools to extract useful information from a massive amount of data. In particular, machine learning finds its application in cybersecurity that begins to enter the age of automation. However, machine learning applications in cybersecurity face unique challenges other domains rarely do—attacks from active adversaries. Problems in areas such as intrusion detection, banking fraud detection, spam filtering, and malware detection have to face challenges of adversarial attacks that modify data so that malicious instances would evade detection by the learning systems. The adversarial learning problem naturally resembles a game between the learning system and the adversary. In such a game, both players would attempt to play their best strategies against each other while maximizing their own payoffs. To solve the game, each player would search for an optimal strategy against the opponent based on the prediction of the opponent's strategy choice. The problem becomes even more complicated in settings where the learning system may have to deal with many adversaries of unknown types. Applying game‐theoretic approach, robust learning techniques have been developed to specifically address adversarial attacks and the preliminary results are promising. In this review, we summarize these results.

...read moreread less

Journal Article•DOI•

Subgroup identification for precision medicine: A comparative review of 13 methods

[...]

Wei-Yin Loh¹, Luxi Cao¹, Peigen Zhou¹•Institutions (1)

University of Wisconsin-Madison¹

01 Sep 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This work reviews 13 subgroup identification methods and uses real‐world and simulated data to compare the performance of their publicly available software using seven criteria, showing that many methods fare poorly on at least one criterion.

...read moreread less

Abstract: Natural heterogeneity in patient populations can make it very hard to develop treatments that benefit all patients. As a result, an important goal of precision medicine is identification of patient subgroups that respond to treatment at a much higher (or lower) rate than the population average. Despite there being many subgroup identification methods, there is no comprehensive comparative study of their statistical properties. We review 13 methods and use real‐world and simulated data to compare the performance of their publicly available software using seven criteria: (a) bias in selection of subgroup variables, (b) probability of false discovery, (c) probability of identifying correct predictive variables, (d) bias in estimates of subgroup treatment effects, (e) expected subgroup size, (f) expected true treatment effect of subgroups, and (g) subgroup stability. The results show that many methods fare poorly on at least one criterion.

...read moreread less

Journal Article•DOI•

A survey on association rules mining using heuristics

[...]

Seyed Mohssen Ghafari¹, Christos Tjortjis¹•Institutions (1)

International Hellenic University¹

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Comparison results indicate that SRmining, PMES, Ant‐ARM, and MDS‐H are the fastest heuristic ARM algorithms, and HSBO‐TS is the most complete one, while SRmining and ACS require only one database scan.

...read moreread less

Abstract: Association rule mining (ARM) is a commonly encountred data mining method. There are many approaches to mining frequent rules and patterns from a database and one among them is heuristics. Many heuristic approaches have been proposed but, to the best of our knowledge, there is no comprehensive literature review on such approaches, yet with only a limited attempt. This gap needs to be filled. This paper reviews heuristic approaches to ARM and points out their most significant strengths and weaknesses. We propose eight performance metrics, such as execution time, memory consumption, completeness, and interestingness, we compare approaches against these performance metrics and discuss our findings. For instance, comparison results indicate that SRmining, PMES, Ant‐ARM, and MDS‐H are the fastest heuristic ARM algorithms. HSBO‐TS is the most complete one, while SRmining and ACS require only one database scan. In addition, we propose a parameter, named GT‐Rank for ranking heuristic ARM approaches, and based on that, ARMGA, ASC, and Kua emerge as the best approaches. We also consider ARM algorithms and their characteristics as transactions and items in a transactional database, respectively, and generate association rules that indicate research trends in this area.

...read moreread less

Journal Article•DOI•

An advanced review on text mining in medicine

[...]

Carmen Luque, José María Luna¹, María Dolores Rubio Luque¹, Sebastián Ventura², Sebastián Ventura¹ - Show less +1 more•Institutions (2)

University of Córdoba (Spain)¹, King Abdulaziz University²

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this review, more than 90 relevant research studies have been analyzed, describing the most important practical applications, terminological resources, tools, and open challenges of TM in medicine.

...read moreread less

Abstract: Health care professionals produce abundant textual information in their daily clinical practice and this information is stored in many diverse sources and, generally, in textual form. The extraction of insights from all the gathered information, mainly unstructured and lacking normalization, is one of the major challenges in computational medicine. In this respect, text mining (TM) assembles different techniques to derive valuable insights from unstructured textual data so it has led to be especially relevant in medicine. The aim of this paper is therefore to provide an extensive review of existing techniques and resources to perform TM tasks in medicine. In this review, more than 90 relevant research studies have been analyzed, describing the most important practical applications, terminological resources, tools, and open challenges of TM in medicine.

...read moreread less

Journal Article•DOI•

Internet of Things and data mining: From applications to techniques and systems

[...]

Mohamed Medhat Gaber¹, Adel Aneiba¹, Shadi Basurra¹, Oliver Batty¹, Ahmed M. Elmisery², Yevgeniya Kovalchuk¹, Muhammad Habib ur Rehman³ - Show less +3 more•Institutions (3)

Birmingham City University¹, Nottingham Trent University², National University of Computer and Emerging Sciences³

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The sheer amount of data stemmed from devices forming the IoT requires new data mining systems and techniques that are discussed and categorized later in this paper.

...read moreread less

Abstract: The Internet of Things (IoT) is the result of the convergence of sensing, computing, and networking technologies, allowing devices of varying sizes and computational capabilities (things) to intercommunicate. This communication can be achieved locally enabling what is known as edge and fog computing, or through the well‐established Internet infrastructure, exploiting the computational resources in the cloud. The IoT paradigm enables a new breed of applications in various areas including health care, energy management and smart cities. This paper starts off with reviewing these applications and their potential benefits. Challenges facing the realization of such applications are then discussed. The sheer amount of data stemmed from devices forming the IoT requires new data mining systems and techniques that are discussed and categorized later in this paper. Finally, the paper is concluded with future research directions.

...read moreread less

Journal Article•DOI•

Benchmarking in classification and regression

[...]

Frank Hoffmann¹, Frank Hoffmann², Frank Hoffmann³, Torsten Bertram², Ralf Mikut³, Ralf Mikut², Ralf Mikut¹, Markus Reischl³, Oliver Nelles¹ - Show less +5 more•Institutions (3)

Folkwang University of the Arts¹, Technical University of Dortmund², Karlsruhe Institute of Technology³

01 Sep 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A good overview of the status quo in benchmarking in classification and nonlinear regression can be found in this article, where the authors present guidelines and best practices for benchmarking and discuss performance metrics for a sound statistical comparative analysis.

...read moreread less

Abstract: The article presents an overview of the status quo in benchmarking in classification and nonlinear regression. It outlines guidelines for a comparative analysis in machine learning, benchmarking principles, accuracy estimation, and model validation. It provides references to established repositories and competitions and discusses the objectives and limitations of benchmarking. Benchmarking is key to progress in machine learning as it allows an unprejudiced comparison among alternative methods. This article presents guidelines and best practices for benchmarking in classification and regression. It reviews state‐of‐the‐art approaches in machine learning, establishes benchmarking principles and discusses performance metrics for a sound statistical comparative analysis.

...read moreread less

Journal Article•DOI•

Expression of Concern: Facial feature discovery for ethnicity recognition

[...]

Cunrui Wang¹, Cunrui Wang², Qingling Zhang¹, Wanquan Liu³, Yu Liu², Lixin Miao² - Show less +2 more•Institutions (3)

Northeastern University (China)¹, Dalian Nationalities University², Curtin University³

01 Jan 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper first constructs an ethnical group face dataset including Chinese Uyghur, Tibetan, and Korean, and constructs three “T” regions in a face image for ethnical feature representation and proves them to be effective areas for ethnicity recognition.

...read moreread less

Abstract: The salient facial feature discovery is one of the important research tasks in ethnical group face recognition. In this paper, we first construct an ethnical group face dataset including Chinese Uyghur, Tibetan, and Korean. Then, we show that the effective sparse sensing approach to general face recognition is not working anymore for ethnical group facial recognition if the features based on whole face image are used. This is partially due to a fact that each ethnical group may have its own characteristics manifesting only in specified face regions. Therefore, we will analyze the particularity of three ethnical groups and aim to find the common characterizations in some local regions for the three ethnical groups. For this purpose, we first use the facial landmark detector STASM to find some important landmarks in a face image, then, we use the well‐known data mining technique, the mRMR algorithm, to select the salient geometric length features based on all possible lines connected by any two landmarks. Second, based on these selected salient features, we construct three “T” regions in a face image for ethnical feature representation and prove them to be effective areas for ethnicity recognition. Finally, some extensive experiments are conducted and the results reveal that the proposed “T” regions with extracted features are quite effective for ethnical group facial recognition when the L2‐norm is adopted using the sparse sensing approach. In comparison to face recognition, the proposed three “T” regions are evaluated on the olivetti research laboratory face dataset, and the results show that the constructed “T” regions for ethnicity recognition are not suitable for general face recognition.

...read moreread less

Journal Article•DOI•

Neuromorphic vision: From sensors to event-based algorithms

[...]

Annamalai Lakshmi¹, Annamalai Lakshmi², Anirban Chakraborty¹, Chetan Singh Thakur¹•Institutions (2)

Indian Institute of Science¹, Defence Research and Development Organisation²

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The state‐of‐the‐art event‐based vision algorithms are reviewed by categorizing them into three major vision applications, object detection/recognition, object tracking, localization and mapping, which enables more robustness.

...read moreread less

Abstract: Regardless of the marvels brought by the conventional frame-based cameras, they have significant drawbacks due to their redundancy in data and temporal latency. This causes problem in applications where low-latency transmission and high-speed processing are mandatory. Proceeding along this line of thought, the neurobiological principles of the biological retina have been adapted to accomplish data sparsity and high dynamic range at the pixel level. These bio-inspired neuromorphic vision sensors alleviate the more serious bottleneck of data redundancy by responding to changes in illumination rather than to illumination itself. This paper reviews in brief one such representative of neuromorphic sensors, the activity-driven event-based vision sensor, which mimics human eyes. Spatio-temporal encoding of event data permits incorporation of time correlation in addition to spatial correlation in vision processing, which enables more robustness. Henceforth, the conventional vision algorithms have to be reformulated to adapt to this new generation vision sensor data. It involves design of algorithms for sparse, asynchronous, and accurately timed information. Theories and new researches have begun emerging recently in the domain of event-based vision. The necessity to compile the vision research carried out in this sensor domain has turned out to be considerably more essential. Towards this, this paper reviews the state-of-the-art event-based vision algorithms by categorizing them into three major vision applications, object detection/recognition, object tracking, localization and mapping. This article is categorized under: Technologies > Machine Learning

...read moreread less

Journal Article•DOI•

Big data processing tools: An experimental performance evaluation

[...]

Mário Rodrigues, Maribel Yasmina Santos¹, Jorge Bernardino²•Institutions (2)

University of Minho¹, University of Coimbra²

01 Mar 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper describes and evaluates the following popular Big Data processing tools: Drill, HAWQ, Hive, Impala, Presto, and Spark, and highlights the performance of each tool, according to different workloads and query types.

...read moreread less

Abstract: FCT – Fundacao para a Ciencia e Tecnologia, Grant/Award Number: UID/CEC/00319/2013; COMPETE, Grant/Award Number: POCI01-0145-FEDER-007043

...read moreread less

Journal Article•DOI•

Model‐based clustering and classification of functional data

[...]

Faicel Chamroukhi¹, Hien D. Nguyen²•Institutions (2)

University of Caen Lower Normandy¹, La Trobe University²

01 Jul 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this article, the authors present approaches for model-based clustering and classification of functional data, and derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering, missing information, and dynamical hidden structure.

...read moreread less

Abstract: The problem of complex data analysis is a central topic of modern statistical science and learning systems and is becoming of broader interest with the increasing prevalence of high-dimensional data. The challenge is to develop statistical models and autonomous algorithms that are able to acquire knowledge from raw data for exploratory analysis, which can be achieved through clustering techniques or to make predictions of future data via classification (i.e., discriminant analysis) techniques. Latent data models, including mixture model-based approaches are one of the most popular and successful approaches in both the unsupervised context (i.e., clustering) and the supervised one (i.e, classification or discrimination). Although traditionally tools of multivariate analysis, they are growing in popularity when considered in the framework of functional data analysis (FDA). FDA is the data analysis paradigm in which the individual data units are functions (e.g., curves, surfaces), rather than simple vectors. In many areas of application, the analyzed data are indeed often available in the form of discretized values of functions or curves (e.g., time series, waveforms) and surfaces (e.g., 2d-images, spatio-temporal data). This functional aspect of the data adds additional difficulties compared to the case of a classical multivariate (non-functional) data analysis. We review and present approaches for model-based clustering and classification of functional data. We derive well-established statistical models along with efficient algorithmic tools to address problems regarding the clustering and the classification of these high-dimensional data, including their heterogeneity, missing information, and dynamical hidden structure. The presented models and algorithms are illustrated on real-world functional data analysis problems from several application area.

...read moreread less

Journal Article•DOI•

Survey on establishing the optimal number of factors in exploratory factor analysis applied to data mining

[...]

Laszlo Barna Iantovics, Corina Rotar, Florica Morar

01 Mar 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This survey studies the EFA applied to data mining, focusing on the problem of establishing of the optimal number of factors to be retained, and the main focus was on the most frequently applied factor selection methods, namely Kaiser Criterion, Cattell's Scree test, and Monte Carlo Parallel Analysis.

...read moreread less

Abstract: In many types of researches and studies including those performed by the sciences of agriculture and plant sciences, large quantities of data are frequently obtained that must be analyzed using different data mining techniques. Sometimes data mining involves the application of different methods of statistical data analysis. Exploratory Factor Analysis (EFA) is frequently used as a technique for data reduction and structure detection in data mining. In our survey, we study the EFA applied to data mining, focusing on the problem of establishing of the optimal number of factors to be retained. The number of factors to retain is the most important decision to take after the factor extraction in EFA. Many researchers discussed the criteria for choosing the optimal number of factors. Mistakes in factor extraction may consist in extracting too few or too many factors. An inappropriate number of factors may lead to erroneous conclusions. A comprehensive review of the state‐of‐the‐art related to this subject was made. The main focus was on the most frequently applied factor selection methods, namely Kaiser Criterion, Cattell's Scree test, and Monte Carlo Parallel Analysis. We have highligthed the importance of the analysis in some research, based on the research specificity, of the total cumulative variance explained by the selected optimal number of extracted factors. It is necessary that the extracted factors explain at least a minimum threshold of cumulative variance. ExtrOptFact algorithm presents the steps that must be performed in EFA for the selection of the optimal number of factors. For validation purposes, a case study was presented, performed on data obtained in an experimental study that we made on Brassica napus plant. Applying the ExtrOptFact algorithm for Principal Component Analysis can be decided on the selection of three components that were called Qualitative, Generative, and Vegetative, which explained 92% of the total cumulative variance.

...read moreread less

Journal Article•DOI•

A data mining application in credit scoring processes of small and medium enterprises commercial corporate customers

[...]

Nihan Gulsoy¹, Sinem Kulluk¹•Institutions (1)

Erciyes University¹

01 May 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This study proposes an objective risk measurement method for the lending process of SME commercial corporate customers and performs classification task of data mining by collecting current customer data on credit evaluation process of a bank.

...read moreread less

Abstract: The constant need to assess loans makes risk evaluation a very important problem for the banking sector. A crucial function of the banks is to fund households and companies from various industries in the economy. Risk is taken by the banks as soon as a loan is given to an entity. Currently, there are sector‐and‐experience based methods of analysis employed by the banks to estimate the risks to be taken. For the credit process, there exist a large number of studies in the literature on scoring individual clients but there are very few studies on scoring small and medium enterprises (SME) commercial corporate customers. In this study, we propose an objective risk measurement method for the lending process of SME commercial corporate customers and performed classification task of data mining by collecting current customer data on credit evaluation process of a bank. For this purpose, we first create a risk measure by looking into the risks identified for existing customers by the analysts of a bank. These scores are used as target variable in the classification process. Then, we extract rules for estimating these scores using Weka software. We used six different algorithms, and compared results in terms of test accuracy, the number of rules, recall, precision and Kappa statistic. We obtained high accuracy rates on real life data by our approach. As a result, we showed that an objective evaluation strategy is possible to use in the lending process for SME commercial corporate customers in the banking system using data mining.

...read moreread less

Journal Article•DOI•

Discovering the influence of sarcasm in social media responses

[...]

Wei Peng¹, Achini Adikari¹, Damminda Alahakoon¹, John S. Gero², John S. Gero³ - Show less +1 more•Institutions (3)

La Trobe University¹, University of North Carolina at Charlotte², Krasnow Institute for Advanced Study³

01 Nov 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Analysis of tweets during a service disruption of a leading Australian organization as a case study found that sarcastic expressions during the service disruption are higher than on regular days and negative sarcastic tweets attract significantly higher social media responses when compared to literal negative expressions.

...read moreread less

Abstract: Sarcasm in verbal and nonverbal communication is known to attract higher attention and create deeper influence than other negative responses. Many people are adept at including sarcasm in written communication thus sarcastic comments have the potential to stimulate the virality of social media content. Although diverse computational approaches have been used to detect sarcasm in social media, the use of text mining to explore the influential role of sarcasm in spreading negative content is limited. Using tweets during a service disruption of a leading Australian organization as a case study, we explore this phenomenon using a text mining framework with a combination of statistical modeling and natural language processing (NLP) techniques. Our work targets two main outcomes: the quantification of the influence of sarcasm and the exploration of the change in topical relationships in the conversations over time. We found that sarcastic expressions during the service disruption are higher than on regular days and negative sarcastic tweets attract significantly higher social media responses when compared to literal negative expressions. The content analysis showed that consumers initially complaining sarcastically about the outage tended to eventually widen the negative sarcasm in a cascading effect towards the organization's internal issues and strategies. Organizations could utilize such insights to enable proactive decision‐making during crisis situations. Moreover, detailed exploration of these impacts would elevate the current text mining applications, to better understand the impact of sarcasm by stakeholders expressed in a social media environment, which can significantly affect the reputation and goodwill of an organization.

...read moreread less

Journal Article•DOI•

Big data analytics: Machine learning and Bayesian learning perspectives—What is done? What is not?

[...]

Shanmugatha "Shan" Suthaharan¹•Institutions (1)

University of North Carolina at Greensboro¹

01 Jan 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: A new research topic is introduced—transformative knowledge discovery—that provides a research ground to study and develop smart machine learning models and algorithms that are automatic, adaptive, and cognitive to address big data analytics problems and challenges.

...read moreread less

Abstract: Big data analytics provides an interdisciplinary framework that is essential to support the current trend for solving real-world problems collaboratively. The progression of big data analytics framework must be clearly understood so that novel approaches can be developed to advance this state-of-the-art discipline. An ignorance of observing the progression of this fast-growing discipline may lead to duplications in research and waste of efforts. Its main companion field, machine learning, helps solve many big data analytics problems; therefore, it is also important to understand the progression of machine learning in the big data analytics framework. One of the current research efforts in big data analytics is the integration of deep learning and Bayesian optimization, which can help the automatic initialization and optimization of hyperparameters of deep learning and enhance the implementation of iterative algorithms in software. The hyperparameters include the weights used in deep learning, and the number of clusters in Bayesian mixture models that characterize data heterogeneity. The big data analytics research also requires computer systems and software that are capable of storing, retrieving, processing, and analyzing big data that are generally large, complex, heterogeneous, unstructured, unpredictable, and exposed to scalability problems. Therefore, it is appropriate to introduce a new research topic—transformative knowledge discovery—that provides a research ground to study and develop smart machine learning models and algorithms that are automatic, adaptive, and cognitive to address big data analytics problems and challenges. The new research domain will also create research opportunities to work on this interdisciplinary research space and develop solutions to support research in other disciplines that may not have expertise in the research area of big data analytics. For example, the research, such as detection and characterization of retinal diseases in medical sciences and the classification of highly interacting species in environmental sciences can benefit from the knowledge and expertise in big data analytics.

...read moreread less

Journal Article•DOI•

Automatic synthesis of fuzzy systems: An evolutionary overview with a genetic programming perspective

[...]

Adriano Koshiyama¹, Ricardo Tanscheit², Marley M. B. R. Vellasco²•Institutions (2)

University College London¹, Pontifical Catholic University of Rio de Janeiro²

01 Mar 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Special focus is given to Genetic Programming‐based EFSs by providing a taxonomy of the main architectures available, as well as by pointing out the gaps that still prevail in the literature.

...read moreread less

Abstract: Studies in Evolutionary Fuzzy Systems (EFSs) began in the 90s and have experienced a fast development since then, with applications to areas such as pattern recognition, curve‐fitting and regression, forecasting and control. An EFS results from the combination of a Fuzzy Inference System (FIS) with an Evolutionary Algorithm (EA). This relationship can be established for multiple purposes: fine‐tuning of FIS's parameters, selection of fuzzy rules, learning a rule base or membership functions from scratch, and so forth. Each facet of this relationship creates a strand in the literature, as membership function fine‐tuning, fuzzy rule‐based learning, and so forth and the purpose here is to outline some of what has been done in each aspect. Special focus is given to Genetic Programming‐based EFSs by providing a taxonomy of the main architectures available, as well as by pointing out the gaps that still prevail in the literature. The concluding remarks address some further topics of current research and trends, such as interpretability analysis, multiobjective optimization, and synthesis of a FIS through Evolving methods.

...read moreread less

Journal Article•DOI•

Tunnel crack detection using coarse-to-fine region localization and edge detection

[...]

Ce Li¹, Pinjie Xu¹, Lijinliang Niu¹, Yuan Chen¹, Longshuai Sheng¹, Mingcun Liu¹ - Show less +2 more•Institutions (1)

China University of Mining and Technology¹

01 Sep 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This paper presents an efficient crack detection method in the tunnel concrete structure based on digital image processing and deep learning, and introduces a faster region convolutional neural network to develop a coarse crack region localization and classification, then deploy edge extraction to implement the fine crack edge detection.

...read moreread less

Abstract: Detecting cracks on the concrete surface is crucial for the tunnel health monitoring and maintenance of Chinese transport facilities, since it is closely related with the structural health and reliability. The automated and efficient tunnel crack detection recently has attracted more research studies, particularly cheap availability of digital cameras makes this issue easier. However, it is still a challenging task due to concrete blebs, stains, and illumination over the concrete surface. This paper presents an efficient crack detection method in the tunnel concrete structure based on digital image processing and deep learning. Three contributions of the paper are summarized as follows. First, we collect and annotate a tunnel crack dataset including three kinds of common cracks that might benefit the research in the field. Second, we propose a new coarse‐to‐fine crack detection method using improved preprocessing, coarse crack region localization and classification, and fine crack edge detection. Third, we introduce a faster region convolutional neural network to develop a coarse crack region localization and classification, then deploy edge extraction to implement the fine crack edge detection, gaining a high‐efficiency and high‐accuracy performance.

...read moreread less

Journal Article•DOI•

An overview on evolutionary algorithms for many-objective optimization problems

[...]

Christian von Lücken¹, Carlos A. Brizuela², Benjamín Barán¹•Institutions (2)

Universidad Nacional de Asunción¹, Ensenada Center for Scientific Research and Higher Education²

01 Jan 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The analysis of the reviewed works indicates the promising future of such methods, especially decomposition‐based approaches; however, much still need to be done to develop more robust, faster, and predictable evolutionary many‐objective algorithms.

...read moreread less

Abstract: Multiobjective evolutionary algorithms (MOEAs) effectively solve several complex optimization problems with two or three objectives. However, when they are applied to many‐objective optimization, that is, when more than three criteria are simultaneously considered, the performance of most MOEAs is severely affected. Several alternatives have been reported to reproduce the same performance level that MOEAs have achieved in problems with up to three objectives when considering problems with higher dimensions. This work briefly reviews the main search difficulties, visualization, evaluation of algorithms, and new procedures in many‐objective optimization using evolutionary methods. Approaches for the development of evolutionary many‐objective algorithms are classified into: (a) based on preference relations, (b) aggregation‐based, (c) decomposition‐based, (d) indicator‐based, and (e) based on dimensionality reduction. The analysis of the reviewed works indicates the promising future of such methods, especially decomposition‐based approaches; however, much still need to be done to develop more robust, faster, and predictable evolutionary many‐objective algorithms.

...read moreread less

Journal Article•DOI•

On the application of sequential pattern mining primitives to process discovery: overview, outlook and opportunity identification

[...]

Marwan Hassani¹, Sebastiaan J. van Zelst², Sebastiaan J. van Zelst³, Wil M. P. van der Aalst², Wil M. P. van der Aalst³ - Show less +1 more•Institutions (3)

Eindhoven University of Technology¹, Fraunhofer Society², RWTH Aachen University³

01 Nov 2019-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An overview of the two fields of sequential pattern mining and stream‐based process discovery is provided, covering their commonalities and differences, highlight the challenges of applying them, and, present an outlook and several avenues for future work.

...read moreread less

Abstract: Sequential pattern mining (SPM) is a well-studied theme in data mining, in which one aims to discover common sequences of item sets in a large corpus of temporal itemset data. Due to the sequential nature of data streams, supporting SPM in streaming environments is commonly studied in the area of data stream mining as well. On the other hand, stream-based process discovery (PD), originating from the field of process mining, focusses on learning process models on the basis of online event data. In particular, the main goal of the models discovered is to describe the underlying generating process in an end-to-end fashion. As both SPM and PD use data that are comparable in nature, that is, both involve time-stamped instances, one expects that techniques from the SPM domain are (partly) transferable to the PD domain. However, thus far, little work has been done in the intersection of the two fields. In this focus article, we therefore study the possible application of SPM techniques in the context of PD. We provide an overview of the two fields, covering their commonalities and differences, highlight the challenges of applying them, and, present an outlook and several avenues for future work. This article is categorized under: Algorithmic Development > Spatial and Temporal Data Mining Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining.

...read moreread less