scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2021"


Journal ArticleDOI
TL;DR: A historical perspective of explainability in AI is presented and criteria for explanations are proposed that are believed to play a crucial role in the development of human‐understandable explainable systems.
Abstract: Explainability in Artificial Intelligence (AI) has been revived as a topic of active research by the need of conveying safety and trust to users in the “how” and “why” of automated decision‐making in different applications such as autonomous driving, medical diagnosis, or banking and finance. While explainability in AI has recently received significant attention, the origins of this line of work go back several decades to when AI systems were mainly developed as (knowledge‐based) expert systems. Since then, the definition, understanding, and implementation of explainability have been picked up in several lines of research work, namely, expert systems, machine learning, recommender systems, and in approaches to neural‐symbolic learning and reasoning, mostly happening during different periods of AI history. In this article, we present a historical perspective of Explainable Artificial Intelligence. We discuss how explainability was mainly conceived in the past, how it is understood in the present and, how it might be understood in the future. We conclude the article by proposing criteria for explanations that we believe will play a crucial role in the development of human‐understandable explainable systems.

118 citations


Journal ArticleDOI
TL;DR: A review of the state-of-the-art in relation to explainability of artificial intelligence in the context of recent advances in machine learning and deep learning can be found in this paper.
Abstract: This paper provides a brief analytical review of the current state-of-the-art in relation to the explainability of artificial intelligence in the context of recent advances in machine learning and deep learning. The paper starts with a brief historical introduction and a taxonomy, and formulates the main challenges in terms of explainability building on the recently formulated National Institute of Standards four principles of explainability. Recently published methods related to the topic are then critically reviewed and analyzed. Finally, future directions for research are suggested.

82 citations


Journal ArticleDOI
TL;DR: This work aims to present a survey of recent developments in analyzing the multimodal sentiments (involving text, audio, and video/image) which involve human–machine interaction and challenges involved in analyzing them.
Abstract: The analysis of sentiments is essential in identifying and classifying opinions regarding a source material that is, a product or service. The analysis of these sentiments finds a variety of applications like product reviews, opinion polls, movie reviews on YouTube, news video analysis, and health care applications including stress and depression analysis. The traditional approach of sentiment analysis which is based on text involves the collection of large textual data and different algorithms to extract the sentiment information from it. But multimodal sentimental analysis provides methods to carry out opinion analysis based on the combination of video, audio, and text which goes a way beyond the conventional text‐based sentimental analysis in understanding human behaviors. The remarkable increase in the use of social media provides a large collection of multimodal data that reflects the user's sentiment on certain aspects. This multimodal sentimental analysis approach helps in classifying the polarity (positive, negative, and neutral) of the individual sentiments. Our work aims to present a survey of recent developments in analyzing the multimodal sentiments (involving text, audio, and video/image) which involve human–machine interaction and challenges involved in analyzing them. A detailed survey on sentimental dataset, feature extraction algorithms, data fusion methods, and efficiency of different classification techniques are presented in this work.

47 citations



Journal ArticleDOI
TL;DR: This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.
Abstract: The unprecedented size of the human population, along with its associated economic activities, has an ever‐increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide resources. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision‐makers in ways that they can effectively use them. The development and deployment of tools and techniques to generate these indicators require having access to trustworthy data from biological collections, field surveys and automated sensors, molecular data, and historic academic literature. The transformation of these raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques applied to manage and analyze these data constitute an area usually called biodiversity informatics. Biodiversity data follow a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.

29 citations


Journal ArticleDOI
TL;DR: The authors provide a comprehensive overview of different models proposed for the QA task, including both traditional information retrieval perspective, and more recent deep neural network perspective, including deep learning approaches, which are the main focus of this paper.
Abstract: Text-based Question Answering (QA) is a challenging task which aims at finding short concrete answers for users' questions. This line of research has been widely studied with information retrieval techniques and has received increasing attention in recent years by considering deep neural network approaches. Deep learning approaches, which are the main focus of this paper, provide a powerful technique to learn multiple layers of representations and interaction between questions and texts. In this paper, we provide a comprehensive overview of different models proposed for the QA task, including both traditional information retrieval perspective, and more recent deep neural network perspective. We also introduce well-known datasets for the task and present available results from the literature to have a comparison between different techniques.

25 citations



Journal ArticleDOI
TL;DR: The current trends, techniques, and methods that are being used in the privacy‐preserving data mining field are identified to make a clear and concise classification of the PPDM techniques and techniques with possibly identifying new methods and techniques that were not included in the previous classification.
Abstract: In the modern days, the amount of the data and information is increasing along with their accessibility and availability, due to the Internet and social media. To be able to search this vast data set and to discover unknown useful data patterns and predictions, the data mining method is used. Data mining allows for unrelated data to be connected in a meaningful way, to analyze the data, and to represent the results in the form of useful data patterns and predictions that help and predict future behavior. The process of data mining can potentially violate sensitive and personal data. Individual privacy is under attack if some of the information leaks and reveals the identity of a person whose personal data were used in the data mining process. There are many privacy‐preserving data mining (PPDM) techniques and methods that have a task to preserve the privacy and sensitive data while providing accurate data mining results at the same time. PPDM techniques and methods incorporate different approaches that protect data in the process of data mining. The methodology that was used in this article is the systematic literature review and bibliometric analysis. This article identifieds the current trends, techniques, and methods that are being used in the privacy‐preserving data mining field to make a clear and concise classification of the PPDM methods and techniques with possibly identifying new methods and techniques that were not included in the previous classification, and to emphasize the future research directions.

18 citations


Journal ArticleDOI
TL;DR: This work analyzes studies investigating the scholarly data generated via academic technologies such as scholarly networks and digital libraries for building scalable approaches for retrieving, recommending, and analyzing the scholarly content, classifying them into different applications based on literature features and highlighting the machine learning techniques used for this purpose.
Abstract: During the last few decades, the widespread growth of scholarly networks and digital libraries has resulted in an explosion of publicly available scholarly data in various forms such as authors, papers, citations, conferences, and journals. This has created interest in the domain of big scholarly data analysis that analyses worldwide dissemination of scientific findings from different perspectives. Although the study of big scholarly data is relatively new, some studies have emerged on how to investigate scholarly data usage in different disciplines. These studies motivate investigating the scholarly data generated via academic technologies such as scholarly networks and digital libraries for building scalable approaches for retrieving, recommending, and analyzing the scholarly content. We have analyzed these studies following a systematic methodology, classifying them into different applications based on literature features and highlighting the machine learning techniques used for this purpose. We also discuss open challenges that remain unsolved to foster future research in the field of scholarly data mining.

18 citations


Journal ArticleDOI
TL;DR: This review can act as a baseline for deep learning and machine vision experts, historical geographers, and scholars by providing them a view of how hyperspectral imaging is implemented in multiple domains along with future research prospects.
Abstract: Hyperspectral imaging has shown tremendous growth over the past three decades. Hyperspectral imaging was evolved through remote sensing. Along, with the technological enhancements hyperspectral imaging has outgrown, conquering over other various application areas. In addition to it, data enriched data cubes with abundant spectral and spatial information works as perk for capturing, analyzing, reviewing, and interpreting results from data. This review concentrates on emerging application areas of hyperspectral imaging. Emerging application areas are selected in ways where there is a vast scope for future enhancements by exploiting cutting edge technology, that is, deep learning. Applications of hyperspectral imaging techniques in some selected areas (remote sensing, document forgery, history and archaeology conservation, surveillance and security, machine vision for fruit quality inspection, medical imaging) are focused. The review pivots around the publicly available datasets and features used domain wise. This review can act as a baseline for deep learning and machine vision experts, historical geographers, and scholars by providing them a view of how hyperspectral imaging is implemented in multiple domains along with future research prospects.

18 citations



Journal ArticleDOI
TL;DR: In this survey, various recent advances in this evolving domain in the context of digital logic testing and diagnosis are looked at.
Abstract: The insistent trend in today's nanoscale technology, to keep abreast of the Moore's law, has been continually opening up newer challenges to circuit designers. With rapid downscaling of integration, the intricacies involved in the manufacturing process have escalated significantly. Concomitantly, the nature of defects in silicon chips has become more complex and unpredictable, adding further difficulty in circuit testing and diagnosis. The volume of test data has surged and the parameters that govern testing of integrated circuits have increased not only in dimension but also in the complexity of their correlation. Evidently, the current scenario serves as a pertinent platform to explore new test solutions based on machine learning. In this survey, we look at various recent advances in this evolving domain in the context of digital logic testing and diagnosis.

Journal ArticleDOI
TL;DR: The managerial implication of the study is that organizations can apply the results of the critical analysis to strengthen their strategic deployment of big data analytics in business settings, and hence to better leverage big data for sustainable organizational innovation and growth.
Abstract: In the era of “big data,” a huge number of people, devices, and sensors are connected via digital networks and the cross‐plays among these entities generate enormous valuable data that facilitate organizations to innovate and grow. However, the data deluge also raises serious privacy concerns which may cause a regulatory backlash and hinder further organizational innovation. To address the challenge of information privacy, researchers have explored privacy‐preserving methodologies in the past two decades. However, a thorough study of privacy preserving big data analytics is missing in existing literature. The main contributions of this article include a systematic evaluation of various privacy preservation approaches and a critical analysis of the state‐of‐the‐art privacy preserving big data analytics methodologies. More specifically, we propose a four‐dimensional framework for analyzing and designing the next generation of privacy preserving big data analytics approaches. Besides, we contribute to pinpoint the potential opportunities and challenges of applying privacy preserving big data analytics to business settings. We provide five recommendations of effectively applying privacy‐preserving big data analytics to businesses. To the best of our knowledge, this is the first systematic study about state‐of‐the‐art in privacy‐preserving big data analytics. The managerial implication of our study is that organizations can apply the results of our critical analysis to strengthen their strategic deployment of big data analytics in business settings, and hence to better leverage big data for sustainable organizational innovation and growth.


Journal ArticleDOI
TL;DR: This survey is to provide a comprehensive analysis of the research efforts so far devoted to the problem of table understanding and to describe systems that support the transformation of heterogeneous tables into meaningful information.
Abstract: Table understanding methods extract, transform, and interpret the information contained in tabular data embedded in documents/files of different formats. Such automatic understanding would allow to exploit tabular information with the aim of accurately answering queries, or integrating heterogeneous repositories of information in a common knowledge base, or exchanging information among different sources. The purpose of this survey is to provide a comprehensive analysis of the research efforts so far devoted to the problem of table understanding and to describe systems that support the transformation of heterogeneous tables into meaningful information.


Journal ArticleDOI
TL;DR: A time series prediction framework of urban surface temperature under cloud interference is proposed to make up for the missing data affected by cloud, snow, and other interference factors, and to be applied to the prediction of the spatial and temporal distributions of LST.
Abstract: For prediction of urban remote sensing surface temperature, cloud, cloud shadow and snow contamination lead to the failure of surface temperature inversion and vegetation‐related index calculation. A time series prediction framework of urban surface temperature under cloud interference is proposed in this paper. This is helpful to solve the problem of the impact of data loss on surface temperature prediction. Spatial and temporal variation trends of surface temperature and vegetation index are analyzed using Landsat 7/8 remote sensing data of 2010 to 2019 from Beijing. The geographically weighed regression (GWR) method is used to realize the simulation of surface temperature based on the current date. The deep learning prediction network based on convolution and long short‐term memory (LSTM) networks was constructed to predict the spatial distribution of surface temperature on the next observation date. The time series analysis shows that the NDBI is less than −0.2, which indicates that there may be cloud contamination. The land surface temperature (LST) modeling results show that the precision of estimation using GWR method on impervious surface and water bodies is superior compared to the vegetation area. For LST prediction using deep learning methods, the result of the prediction on surface temperature space distribution was relatively good. The purpose of this study is to make up for the missing data affected by cloud, snow, and other interference factors, and to be applied to the prediction of the spatial and temporal distributions of LST.

Journal ArticleDOI
TL;DR: To realize a low‐carbon and sustainable energy transition, smart energy systems (SES) assisted by data and information technology are regarded as promising solutions for energy system integration (ESI) and have been put into regional practices, but there is still lacking attention on the development of multiregional smartEnergy systems (MRSES), which include three or more areas.
Abstract: To realize a low‐carbon and sustainable energy transition, smart energy systems (SES) assisted by data and information technology are regarded as promising solutions for energy system integration (ESI) and have been put into regional practices. However, there is still lacking attention on the development of multiregional smart energy systems (MRSES), which include three or more areas. This article aims to analyze concepts and practices of SES and enlighten a new perspective of MRSES. The conceptual evolution and regional practices of SES in the world were first reviewed, and it was found out that SES does not means the end of the conceptual evolution of ESI. Current regional practices are still limited in small areas, being typically remote areas, urban areas, and industrial areas. Secondly, the review of concepts and practices of SES in China indicate that the understanding of SES concepts are still confusing in national scale, and the apparent regional disparity in China is calling attention on the development of MRSES. Finally, a preliminary concept of MRSES was proposed and its perspective in China and the world, which is composed by four connected sub‐SES and named as a coordinated development of “smart energy farms + smart energy towns + smart energy industrial parks + smart energy transportation networks” was discussed. The former three sub‐SES are identified according to various economic characteristics and resources endowment in different regions, and they are all connected by the forth sub‐SES. Although this concept is still preliminary, it provides an imagination of future large‐scale SES, and the realization of this concept needs further breakthrough of data technology.

Journal ArticleDOI
TL;DR: This systematic literature review aims to assess the use of community detection techniques in analyzing the network's structure in online learning environments and highlighted the need to include automated community discovery techniques in onlinelearning environments to facilitate and enhance their use.
Abstract: Uncovering community structure has made a significant advancement in explaining, analyzing, and forecasting behaviors and dynamics of networks related to different fields in sociology, criminology, biology, medicine, communication, economics, and academia. Detecting and clustering communities is a powerful step toward identifying the structural properties and the behavioral patterns in social networks. Recently, online learning has been progressively adopted by a lot of educational practices which raise many questions about assessing the learners' engagement, collaboration, and behaviors in the new emerging learning communities. This systematic literature review aims to assess the use of community detection techniques in analyzing the network's structure in online learning environments. It provides a comprehensive overview of the existing research that adopted those techniques with identifying the educational objectives behind their application as well as suggesting possible future research directions. Our analysis covered 65 studies that found in the literature and applied different community discovery techniques on various types of online learning environments to analyze their users' interactions patterns. Our review revealed the potential of this field in improving educational practices and decisions and in utilizing the massive amount of data generated from interacting with those environments. Finally, we highlighted the need to include automated community discovery techniques in online learning environments to facilitate and enhance their use as well as we stressed on the urge for further advance research to uncover a lot of hidden opportunities.


Journal ArticleDOI
TL;DR: This article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy.
Abstract: Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies.

Journal ArticleDOI
TL;DR: An overview of the role of imaging in oncology, the different techniques that are shaping the way DL algorithms are being made ready for clinical use, and also the problems that DL techniques still need to address before DL can find a home in clinics are provided.
Abstract: Deep learning (DL)‐based interpretation of medical images has reached a critical juncture of expanding outside research projects into translational ones, and is ready to make its way to the clinics. Advances over the last decade in data availability, DL techniques, as well as computing capabilities have accelerated this journey. Through this journey, today we have a better understanding of the challenges to and pitfalls of wider adoption of DL into clinical care, which, according to us, should and will drive the advances in this field in the next few years. The most important among these challenges are the lack of an appropriately digitized environment within healthcare institutions, the lack of adequate open and representative datasets on which DL algorithms can be trained and tested, and the lack of robustness of widely used DL training algorithms to certain pervasive pathological characteristics of medical images and repositories. In this review, we provide an overview of the role of imaging in oncology, the different techniques that are shaping the way DL algorithms are being made ready for clinical use, and also the problems that DL techniques still need to address before DL can find a home in clinics. Finally, we also provide a summary of how DL can potentially drive the adoption of digital pathology, vendor neutral archives, and picture archival and communication systems. We caution that the respective researchers may find the coverage of their own fields to be at a high‐level. This is so by design as this format is meant to only introduce those looking in from outside of deep learning and medical research, respectively, to gain an appreciation for the main concerns and limitations of these two fields instead of telling them something new about their own.

Journal ArticleDOI
TL;DR: The challenge of heterogeneous multivariate temporal data analysis is discussed and various options to deal with it are discussed, focusing on an increasingly used option of transforming the data into symbolic time intervals through temporal abstraction and the use of time intervals related patterns discovery.
Abstract: The information technology revolution, especially with the adoption of the Internet of Things, longitudinal data in many domains become more available and accessible for secondary analysis. Such data provide meaningful opportunities to understand process in many domains along time, but also challenges. A main challenge is the heterogeneity of the temporal variables due to the different types of data, whether a measurement or an event, and type of samplings: fixed or irregular. Other variables can be also events that may or not have duration. In this review, we discuss the various types of temporal data, and the various relevant analysis methods. Starting with fixed frequency variables, with forecasting and time series methods, and proceeding with sequential data, and sequential patterns mining, and time intervals mining for events having various time duration. Also the use of various deep learning based architectures for temporal data is discussed. The challenge of heterogeneous multivariate temporal data analysis and discuss various options to deal with it, focusing on an increasingly used option of transforming the data into symbolic time intervals through temporal abstraction and the use of time intervals related patterns discovery for temporal knowledge discovery, clustering, classification prediction, and more. Finally, we discuss the overview of the field, and areas in which more studies and contributions are needed.

Journal ArticleDOI
TL;DR: This article provides a comprehensive review of both the non‐spotting and spotting based mining techniques and identifies the limitations of the existing methods and suggests new applications and future directions to continue the research in multiple directions.
Abstract: In computer terminology, mining is considered as extracting meaningful information or knowledge from a large amount of data/information using computers. The meaningful information can be extracted from normal text, and images obtained from different resources, such as natural scene images, video, and documents by deriving semantics from text and content of the images. Although there are many pieces of work on text/data mining and several survey/review papers are published in the literature, to the best of our knowledge there is no survey paper on mining textual information from the natural scene, video, and document images considering word spotting techniques. In this article, we, therefore, provide a comprehensive review of both the non‐spotting and spotting based mining techniques. The mining approaches are categorized as feature, learning and hybrid‐based methods to analyze the strengths and limitations of the models of each category. In addition, it also discusses the usefulness of the methods according to different situations and applications. Furthermore, based on the review of different mining approaches, this article identifies the limitations of the existing methods and suggests new applications and future directions to continue the research in multiple directions. We believe such a review article will be useful to the researchers to quickly become familiar with the state‐of‐the‐art information and progresses made toward mining textual information from natural scene and video images.


Journal ArticleDOI
TL;DR: It is argued that the quality of data mining results is directly related to the extent that they reflect important properties of real‐world entities represented therein, and two particular types of artifacts produced by this area are briefly elaborate.
Abstract: For many years, the role played by domain knowledge in all stages of knowledge discovery has been recognized. However, the real‐world semantics embedded in data is often still not fully considered in traditional data mining methods. In this article, we argue that the quality of data mining results is directly related to the extent that they reflect important properties of real‐world entities represented therein. Analyzing and characterizing the nature of these entities is the very business of the area of formal ontology. We briefly elaborate on two particular types of artifacts produced by this area: foundational ontologies and ontology‐driven conceptual modeling languages grounded on them. We then elaborate on the benefits they can bring to several activities in a data mining process.


Journal ArticleDOI
TL;DR: Existing approaches for behavioral fingerprinting of devices in general are discussed and their applicability for IoT devices is evaluated and the future research directions for fingerprinting in the IoT domain are highlighted.
Abstract: Rapid advances in the Internet‐of‐Things (IoT) domain have led to the development of several useful and interesting devices that have enhanced the quality of home living and industrial automation. The vulnerabilities in the IoT devices have rendered them susceptible to compromise and forgery. The problem of device authentication, that is, the question of whether a device's identity is what it claims to be, is still an open problem. Device fingerprinting seems to be a promising authentication mechanism. Device fingerprinting profiles a device based on information available about the device and generate a robust, verifiable and unique identity for the device. Existing approaches for device fingerprinting may not be feasible or cost‐effective for the IoT domain due to the resource constraints and heterogeneity of the IoT devices. Due to resource and cost constraints, behavioral fingerprinting provides promising directions for fingerprinting IoT devices. Behavioral fingerprinting allows security researchers to understand the behavioral profile of a device and to establish some guidelines regarding the device operations. In this article, we discuss existing approaches for behavioral fingerprinting of devices in general and evaluate their applicability for IoT devices. Furthermore, we discuss potential approaches for fingerprinting IoT devices and give an overview of some of the preliminary attempts to fingerprint IoT devices. We conclude by highlighting the future research directions for fingerprinting in the IoT domain.

Journal ArticleDOI
TL;DR: This review could help investigators understand the principles of existing methods, and thus develop new methods to advance the computational prediction of disease genes.
Abstract: Complex diseases are associated with a set of genes (called disease genes), the identification of which can help scientists uncover the mechanisms of diseases and develop new drugs and treatment strategies. Due to the huge cost and time of experimental identification techniques, many computational algorithms have been proposed to predict disease genes. Although several review publications in recent years have discussed many computational methods, some of them focus on cancer driver genes while others focus on biomolecular networks, which only cover a specific aspect of existing methods. In this review, we summarize existing methods and classify them into three categories based on their rationales. Then, the algorithms, biological data, and evaluation methods used in the computational prediction are discussed. Finally, we highlight the limitations of existing methods and point out some future directions for improving these algorithms. This review could help investigators understand the principles of existing methods, and thus develop new methods to advance the computational prediction of disease genes.