scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2022"


Journal ArticleDOI
TL;DR: This article aims at covering several methodologies that have been developed for causal discovery and causal inference, and provides a practical toolkit for interested researchers and practitioners.
Abstract: Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation‐based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples.

29 citations


Journal ArticleDOI
Tai Le Quy1
TL;DR: In this paper , the authors focus on tabular data as the most common data representation for fairness-aware ML and identify relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network.
Abstract: As decision-making increasingly relies on machine learning (ML) and (big) data, the issue of fairness in data-driven artificial intelligence systems is receiving increasing attention from both research and industry. A large variety of fairness-aware ML solutions have been proposed which involve fairness-related interventions in the data, learning algorithms, and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware ML. We focus on tabular data as the most common data representation for fairness-aware ML. We start our analysis by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate interesting relationships using exploratory analysis. This article is categorized under: Commercial, Legal, and Ethical Issues > Fairness in Data Mining Fundamental Concepts of Data and Knowledge > Data Concepts Technologies > Data Preprocessing

19 citations


Journal ArticleDOI
TL;DR: It is concluded that future research has to holistically consider the automation of the forecasting pipeline to enable the large-scale application of time series forecasting.
Abstract: Time series forecasting is fundamental for various use cases in different domains such as energy systems and economics. Creating a forecasting model for a specific use case requires an iterative and complex design process. The typical design process includes five sections (1) data preprocessing, (2) feature engineering, (3) hyperparameter optimization, (4) forecasting method selection, and (5) forecast ensembling, which are commonly organized in a pipeline structure. One promising approach to handle the ever‐growing demand for time series forecasts is automating this design process. The article, thus, reviews existing literature on automated time series forecasting pipelines and analyzes how the design process of forecasting models is currently automated. Thereby, we consider both automated machine learning (AutoML) and automated statistical forecasting methods in a single forecasting pipeline. For this purpose, we first present and compare the identified automation methods for each pipeline section. Second, we analyze these automation methods regarding their interaction, combination, and coverage of the five pipeline sections. For both, we discuss the reviewed literature that contributes toward automating the design process, identify problems, give recommendations, and suggest future research. This review reveals that the majority of the reviewed literature only covers two or three of the five pipeline sections. We conclude that future research has to holistically consider the automation of the forecasting pipeline to enable the large‐scale application of time series forecasting.

14 citations


Journal ArticleDOI
TL;DR: The current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends, and challenges in this specific research area.
Abstract: The new educational models such as smart learning environments use of digital and context‐aware devices to facilitate the learning process. In this new educational scenario, a huge quantity of multimodal students' data from a variety of different sources can be captured, fused, and analyze. It offers to researchers and educators a unique opportunity of being able to discover new knowledge to better understand the learning process and to intervene if necessary. However, it is necessary to apply correctly data fusion approaches and techniques in order to combine various sources of multimodal learning analytics (MLA). These sources or modalities in MLA include audio, video, electrodermal activity data, eye‐tracking, user logs, and click‐stream data, but also learning artifacts and more natural human signals such as gestures, gaze, speech, or writing. This survey introduces data fusion in learning analytics (LA) and educational data mining (EDM) and how these data fusion techniques have been applied in smart learning. It shows the current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends, and challenges in this specific research area.

12 citations


Journal ArticleDOI
TL;DR: A systematic literature review that provides an overview of the current state of research concerning predictive maintenance from a data mining perspective and presents a first taxonomy that implies different phases considered in any data mining process to solve a predictive maintenance problem.
Abstract: Predictive maintenance is a field of study whose main objective is to optimize the timing and type of maintenance to perform on various industrial systems. This aim involves maximizing the availability time of the monitored system and minimizing the number of resources used in maintenance. Predictive maintenance is currently undergoing a revolution thanks to advances in industrial systems monitoring within the Industry 4.0 paradigm. Likewise, advances in artificial intelligence and data mining allow the processing of a great amount of data to provide more accurate and advanced predictive models. In this context, many actors have become interested in predictive maintenance research, becoming one of the most active areas of research in computing, where academia and industry converge. The objective of this paper is to conduct a systematic literature review that provides an overview of the current state of research concerning predictive maintenance from a data mining perspective. The review presents a first taxonomy that implies different phases considered in any data mining process to solve a predictive maintenance problem, relating the predictive maintenance tasks with the main data mining tasks to solve them. Finally, the paper presents significant challenges and future research directions in terms of the potential of data mining applied to predictive maintenance.

10 citations


Journal ArticleDOI
TL;DR: The application of artificial intelligence (AI) based methods/algorithms to predict the bus arrival time (BAT) is reviewed in detail and thorough discussion is presented to elaborate different branches of AI that have been applied for several aspects of BAT prediction.
Abstract: Buses are one of the important parts of public transport system. To provide accurate information about bus arrival and departure times at bus stops is one of the main parameters of good quality public transport. Accurate arrival and departure times information is important for a public transport mode since it enhances ridership as well as satisfaction of travelers. With accurate arrival‐time and departure time information, travelers can make informed decisions about their journey. The application of artificial intelligence (AI) based methods/algorithms to predict the bus arrival time (BAT) is reviewed in detail. Systematic survey of existing research conducted by various researchers by applying the different branches of AI has been done. Prediction models have been segregated and are accumulated under respective branches of AI. Thorough discussion is presented to elaborate different branches of AI that have been applied for several aspects of BAT prediction. Research gaps and possible future directions for further research work are summarized.

8 citations


Journal ArticleDOI
TL;DR: In this review paper, an extensive survey of research works is presented, seeking to solve Arabic word sense disambiguation with the existing AWSD datasets.
Abstract: In communication, textual data are a vital attribute. In all languages, ambiguous or polysemous words' meaning changes depending on the context in which they are used. The ability to determine the ambiguous word's correct meaning is a Know‐distill challenging task in natural language processing (NLP). Word sense disambiguation (WSD) is an NLP process to analyze and determine the correct meaning of polysemous words in a text. WSD is a computational linguistics task that automatically identifies the polysemous word's set of senses. Based on the context some word comes into view, WSD recognizes and tags the word to its correct priori known meaning. Semitic languages like Arabic have even more significant challenges than other languages since Arabic lacks diacritics, standardization, and a massive shortage of available resources. Recently, many approaches and techniques have been suggested to solve word ambiguity dilemmas in many different ways and several languages. In this review paper, an extensive survey of research works is presented, seeking to solve Arabic word sense disambiguation with the existing AWSD datasets.

8 citations


Journal ArticleDOI
TL;DR: An overview of problems of FSM, important phases in FSM), main groups of F SM, as well as surveying many modern applied algorithms are presented.
Abstract: Large graphs are often used to simulate and model complex systems in various research and application fields. Because of its importance, frequent subgraph mining (FSM) in single large graphs is a vital issue, and recently, it has attracted numerous researchers, and played an important role in various tasks for both research and application purposes. FSM is aimed at finding all subgraphs whose number of appearances in a large graph is greater than or equal to a given frequency threshold. In most recent applications, the underlying graphs are very large, such as social networks, and therefore algorithms for FSM from a single large graph have been rapidly developed, but all of them have NP‐hard (nondeterministic polynomial time) complexity with huge search spaces, and therefore still need a lot of time and memory to restore and process. In this article, we present an overview of problems of FSM, important phases in FSM, main groups of FSM, as well as surveying many modern applied algorithms. This includes many practical applications and is a fundamental premise for many studies in the future.

7 citations


Journal ArticleDOI
TL;DR: A systematic discussion of 10 modern learning paradigms and their connection to the traditional ones, including multi‐label learning (MLL), semi‐supervised learning (SSL), one‐class classification (OCC), positive‐unlabeled learning (PUL), transfer learning (TL), multi‐task learning (MTL), and one‐shot learning (OSL).
Abstract: Machine learning is a field composed of various pillars. Traditionally, supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL) are the dominating learning paradigms that inspired the field since the 1950s. Based on these, thousands of different methods have been developed during the last seven decades used in nearly all application domains. However, recently, other learning paradigms are gaining momentum which complement and extend the above learning paradigms significantly. These are multi‐label learning (MLL), semi‐supervised learning (SSL), one‐class classification (OCC), positive‐unlabeled learning (PUL), transfer learning (TL), multi‐task learning (MTL), and one‐shot learning (OSL). The purpose of this article is a systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data‐driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning‐paradigm graph (LP‐graph). Overall, the LP‐graph establishes a taxonomy among 10 different learning paradigms.

4 citations


Journal ArticleDOI
TL;DR:
Abstract: This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.

4 citations


Journal ArticleDOI
TL;DR: In this paper , the authors proposed the application of digital twins to the healthcare domain to provide enhanced clinical decision support and enable more patient-centric, and simultaneously more precise and individualized care to ensue.
Abstract: Digital twins, succinctly described as the digital representation of a physical object, is a concept that has emerged relatively recently with increasing application in the manufacturing industry. This article proposes the application of this concept to the healthcare domain to provide enhanced clinical decision support and enable more patient‐centric, and simultaneously more precise and individualized care to ensue. Digital twins combined with advances in Artificial Intelligence (AI) have the potential to facilitate the integration and processing of vast amounts of heterogeneous data stemming from diversified sources. Hence, in healthcare this can provide enhanced diagnosis and treatment decision support. In applying digital twins in combination with AI to complex healthcare contexts to assist clinical decision making, it is also likely that a key current challenge in healthcare; namely, providing better quality care which is of high value and can lead to better clinical outcomes and a higher level of patient satisfaction, can ensue. In this focus article, we address this proposition by focusing on the case study of cancer care and present our conceptualization of a digital twin model combined with AI to address key, current limitations in endometrial cancer treatment. We highlight the role of AI techniques in developing digital twins for cancer care and simultaneously identify key barriers and facilitators of this process from both a healthcare and technology perspective.

Journal ArticleDOI
TL;DR: In this paper , generalized additive models for location, scale, and shape (GAMLSS) are used to model all the parameters of the distribution of the response variable with respect to the explanatory variables.
Abstract: The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA.

Journal ArticleDOI
TL;DR: This work focuses on supervised learning, transfer learning, reinforcement learning, and multimodal learning to illustrate how innovative AI methods can enable better‐informed choices, tailor adaptation measures to heterogenous groups and generate effective synergies and trade‐offs.
Abstract: Although artificial intelligence (AI; inclusive of machine learning) is gaining traction supporting climate change projections and impacts, limited work has used AI to address climate change adaptation. We identify this gap and highlight the value of AI especially in supporting complex adaptation choices and implementation. We illustrate how AI can effectively leverage precise, real‐time information in data‐scarce settings. We focus on supervised learning, transfer learning, reinforcement learning, and multimodal learning to illustrate how innovative AI methods can enable better‐informed choices, tailor adaptation measures to heterogenous groups and generate effective synergies and trade‐offs.

Journal ArticleDOI
TL;DR: This article has reviewed and analyzed different MI methods, which are applied to investigate these problems, and found that neural network model based methods are more general and solutions are continuous over the given domain of integration, self‐adaptive and can be used as a black box.
Abstract: This article is dedicated to study the impact of machine intelligence (MI) methods viz. various types of Neural models for investigating dynamical systems arising in interdisciplinary areas. Different types of artificial neural network (ANN) methods, viz., recurrent neural network, functional‐link neural network, convolutional neural network, symplectic artificial neural network, genetic algorithm neural network, and so on, are addressed by different researchers to investigate these problems. Although various traditional methods have been developed by researchers to solve these dynamical problems but the existing traditional methods may sometimes be problem dependent, require repetitions of the simulations, and fail to solve nonlinearity behavior. In this regard, neural network model based methods are more general and solutions are continuous over the given domain of integration, self‐adaptive and can be used as a black box. As such, in this article, we have reviewed and analyzed different MI methods, which are applied to investigate these problems.

Journal ArticleDOI
TL;DR: An overview of the literature on privacy protection in smart meters with a particular focus on homomorphic encryption (HE) is presented, emphasizing the need to safeguard the privacy of smart‐meter users by identifying, describing, and comparing the main approaches that seek to address this problem.
Abstract: This article presents an overview of the literature on privacy protection in smart meters with a particular focus on homomorphic encryption (HE). Firstly, we introduce the concept of smart meters, the context in which they are inserted the main concerns and oppositions inherent to its use. Later, an overview of privacy protection is presented, emphasizing the need to safeguard the privacy of smart‐meter users by identifying, describing, and comparing the main approaches that seek to address this problem. Then, two privacy protection approaches based on HE are presented in more detail and additionally we present two possible application scenarios. Finally, the article concludes with a brief overview of the unsolved challenges in HE and the most promising future research directions.

Journal ArticleDOI
TL;DR: In this article , the authors reviewed and investigated the studies on the diagnosis of sleep apnea using AI methods, including machine learning (ML) and deep learning (DL) methods.
Abstract: Apnea is a sleep disorder that stops or reduces airflow for a short time during sleep. Sleep apnea may last for a few seconds and happen for many while sleeping. This reduction in breathing is associated with loud snoring, which may awaken the person with a feeling of suffocation. So far, a variety of methods have been introduced by researchers to diagnose sleep apnea, among which the polysomnography (PSG) method is known to be the best. Analysis of PSG signals is very complicated. Many studies have been conducted on the automatic diagnosis of sleep apnea from biological signals using artificial intelligence (AI), including machine learning (ML) and deep learning (DL) methods. This research reviews and investigates the studies on the diagnosis of sleep apnea using AI methods. First, computer aided diagnosis system (CADS) for sleep apnea using ML and DL techniques along with its parts including dataset, preprocessing, and ML and DL methods are introduced. This research also summarizes the important specifications of the studies on the diagnosis of sleep apnea using ML and DL methods in a table. In the following, a comprehensive discussion is made on the studies carried out in this field. The challenges in the diagnosis of sleep apnea using AI methods are of paramount importance for researchers. Accordingly, these obstacles are elaborately addressed. In another section, the most important future works for studies on sleep apnea detection from PSG signals and AI techniques are presented. Ultimately, the essential findings of this study are provided in the conclusion section. This article is categorized under: Technologies > Artificial Intelligence Application Areas > Data Mining Software Tools Algorithmic Development > Biological Data Mining

Journal ArticleDOI
TL;DR: A comprehensive characterization of the TU problem, including a description of its subproblems, tasks, subtasks, and applications, is given in this article , where the common limitations used in the existing problem statements and some directions for further research are discussed.
Abstract: Tables are probably the most natural way to represent relational data in various media and formats. They store a large number of valuable facts that could be utilized for question answering, knowledge base population, natural language generation, and other applications. However, many tables are not accompanied by semantics for the automatic interpretation of the information they present. Table Understanding (TU) aims at recovering the missing semantics that enables the extraction of facts from tables. This problem covers a range of issues from table detection in document images to semantic table interpretation with the help of external knowledge bases. To date, the TU research has been ongoing on for 30 years. Nevertheless, there is no common point of view on the scope of TU; the terminology still needs agreement and unification. In recent years, science and technology have shown a rapidly increasing interest in TU. Nowadays, it is especially important to check the meaning of this research problem once again. This article gives a comprehensive characterization of the TU problem, including a description of its subproblems, tasks, subtasks, and applications. It also discusses the common limitations used in the existing problem statements and proposes some directions for further research that would help overcome the corresponding limitations.

Journal ArticleDOI
TL;DR: A review of deep learning based image steganography techniques is presented in this article , where three key parameters (security, embedding capacity, and invisibility) for measuring the quality of a steganographic technique are described.
Abstract: A review of the deep learning based image steganography techniques is presented in this paper. For completeness, the recent traditional steganography techniques are also discussed briefly. The three key parameters (security, embedding capacity, and invisibility) for measuring the quality of an image steganographic technique are described. Various steganography techniques, with emphasis on the above three key parameters, are reviewed. The steganography techniques are classified here into three main categories: Traditional, Hybrid, and fully Deep Learning. The hybrid techniques are further divided into three sub‐categories: Cover Generation, Distortion Learning, and Adversarial Embedding. The fully Deep Learning techniques, based on the nature of the input, are further divided into three sub‐categories: GAN Embedding, Embedding Less, and Category Label. The main ideas of the important deep learning based steganography techniques are described. The strong and weak features of these techniques are outlined. The results reported by researchers on benchmark data sets CelebA, Bossbase, PASCAL‐VOC12, CIFAR‐100, ImageNet, and USC‐SIPI are used to evaluate the performance of various steganography techniques. Analysis of the results shows that there is scope for new suitable deep learning architectures that can improve the capacity and invisibility of image steganography.

Journal ArticleDOI
TL;DR: Two possible approaches to generating datasets that reflect patterns of real ones using a two‐step approach are explored: Constraint‐based generation and probabilistic generative modeling.
Abstract: The development of platforms and techniques for emerging Big Data and Machine Learning applications requires the availability of real‐life datasets. A possible solution is to synthesize datasets that reflect patterns of real ones using a two‐step approach: first, a real dataset X is analyzed to derive relevant patterns Z and, then, to use such patterns for reconstructing a new dataset X′ that preserves the main characteristics of X . This survey explores two possible approaches: (1) Constraint‐based generation and (2) probabilistic generative modeling. The former is devised using inverse mining ( IFM ) techniques, and consists of generating a dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. By contrast, for the latter approach, recent developments in probabilistic generative modeling ( PGM ) are explored that model the generation as a sampling process from a parametric distribution, typically encoded as neural network. The two approaches are compared by providing an overview of their instantiations for the case of discrete data and discussing their pros and cons.

Journal ArticleDOI
TL;DR: A thorough analysis of the scientific literature using data and text mining to uncover knowledge from online reviews due to their importance as user‐generated content finds that information management and technology, e‐commerce, and tourism stand out.
Abstract: This paper reports on a thorough analysis of the scientific literature using data and text mining to uncover knowledge from online reviews due to their importance as user‐generated content. In this context, more than 12,000 papers were extracted from publications indexed in the Scopus database within the last 15 years. Regarding the type of data, most previous studies focused on qualitative textual data to perform their analysis, with fewer looking for quantitative scores and/or characterizing reviewer profiles. In terms of application domains, information management and technology, e‐commerce, and tourism stand out. It is also clear that other areas of potentially valuable applications should be addressed in future research, such as arts and education, as well as more interdisciplinary approaches, namely in the spectrum of the social sciences.

Journal ArticleDOI
TL;DR: The various bio‐inspired optimization techniques and their accuracy in image Steganalysis pertaining to the discovery of embedded information in both JPEG and spatial domain steganalysis are analyzed.
Abstract: Image steganalysis involves the discovery of secret information embedded in an image. The common method is blind image steganalysis, which is a two‐class classification problem. Blind steganalysis extracts all possible feature variations in an image due to embedding, select the most appropriate feature data, and then classifies the image. The dimensionality of the extracted image features are high and demand data reduction to identify the most relevant features and to aid accurate classification of an image. The classification is under two classes namely, clean (cover) image and stego (image with embedded secret data) image. Since the classification accuracy depends on selection of most appropriate features, opting for the best data reduction or data optimization algorithms becomes a prime requisite. Research shows that most of the statistical optimization techniques converge to local minima and lead to less classification accuracy as compared to bio‐inspired methods. Bio‐inspired optimization methods obtain improved classification accuracy by reducing the high‐dimensional image features. These methods start with an initial population and then optimize them in steps till a global optimal point is reached. Examples of such methods include Ant Lion Optimization (ALO), Fire Fly Algorithm (FFA), and literature shows around 54 such algorithms. Bio‐inspired optimization has been applied in various fields of design optimization and is novel to image steganalysis. This article analyses the various bio‐inspired optimization techniques and their accuracy in image steganalysis pertaining to the discovery of embedded information in both JPEG and spatial domain steganalysis.

Journal ArticleDOI
TL;DR: The paper presents the most current review in OSINT, reflecting how the various state‐of‐the‐art tools and techniques can be applied in extracting terrorism‐related textual information from publicly accessible sources.
Abstract: In this contemporary era, where a large part of the world population is deluged by extensive use of the internet and social media, terrorists have found it a potential opportunity to execute their vicious plans. They have got a befitting medium to reach out to their targets to spread propaganda, disseminate training content, operate virtually, and further their goals. To restrain such activities, information over the internet in context of terrorism needs to be analyzed to channel it to appropriate measures in combating terrorism. Open Source Intelligence (OSINT) accounts for a felicitous solution to this problem, which is an emerging discipline of leveraging publicly accessible sources of information over the internet by effectively utilizing it to extract intelligence. The process of OSINT extraction is broadly observed to be in three phases (i) Data Acquisition, (ii) Data Enrichment, and (iii) Knowledge Inference. In the context of terrorism, researchers have given noticeable contributions in compliance with these three phases. However, a comprehensive review that delineates these research contributions into an integrated workflow of intelligence extraction has not been found. The paper presents the most current review in OSINT, reflecting how the various state‐of‐the‐art tools and techniques can be applied in extracting terrorism‐related textual information from publicly accessible sources. Various data mining and text analysis‐based techniques, that is, natural language processing, machine learning, and deep learning have been reviewed to extract and evaluate textual data. Additionally, towards the end of the paper, we discuss challenges and gaps observed in different phases of OSINT extraction.



Journal ArticleDOI
TL;DR: In this paper , the authors provide a comprehensive review of some of the most important research areas related to the online infosphere, focusing on the technical challenges and potential solutions, and outline some important future directions.
Abstract: The evolution of Artificial Intelligence (AI)-based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI-based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow-up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting. This article is categorized under: Application Areas > Internet



Journal ArticleDOI
TL;DR: This work discusses the problem from a geometrical perspective and provides a framework which exploits the metric structure of a data set and suggests a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold.
Abstract: Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework which exploits the metric structure of a data set. Our approach rests on the manifold assumption, that is, that the observed, nominally high‐dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high dimensional data. We also suggest a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high‐dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high‐dimensional and non‐tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.


Journal ArticleDOI
TL;DR: A bird's eye view of the applications of ML in postgenomic biology is provided and attempt is made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence in human health.
Abstract: In recent years, machine learning (ML) has been revolutionizing biology, biomedical sciences, and gene‐based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, require data analysis capabilities using ML that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data‐intensive ML algorithms for the solution of biological problems that until recently had relied on mechanistic model‐based approaches that are computationally expensive. This review provides a bird's eye view of the applications of ML in postgenomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence in human health. Further contributions of ML are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene‐based guidance for the management of complex environments in this age of global warming.