scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2017"


Journal ArticleDOI
TL;DR: An overview of the different ways in which randomization can be applied to the design of neural networks and kernel functions is provided to clarify innovative lines of research, open problems, and foster the exchanges of well‐known results throughout different communities.
Abstract: Neural networks, as powerful tools for data mining and knowledge engineering, can learn from data to build feature-based classifiers and nonlinear predictive models. Training neural networks involves the optimization of nonconvex objective functions, and usually, the learning process is costly and infeasible for applications associated with data streams. A possible, albeit counterintuitive, alternative is to randomly assign a subset of the networks' weights so that the resulting optimization task can be formulated as a linear least-squares problem. This methodology can be applied to both feedforward and recurrent networks, and similar techniques can be used to approximate kernel functions. Many experimental results indicate that such randomized models can reach sound performance compared to fully adaptable ones, with a number of favorable benefits, including 1 simplicity of implementation, 2 faster learning with less intervention from human beings, and 3 possibility of leveraging overall linear regression and classification algorithms e.g., i¾?1 norm minimization for obtaining sparse formulations. This class of neural networks attractive and valuable to the data mining community, particularly for handling large scale data mining in real-time. However, the literature in the field is extremely vast and fragmented, with many results being reintroduced multiple times under different names. This overview aims to provide a self-contained, uniform introduction to the different ways in which randomization can be applied to the design of neural networks and kernel functions. A clear exposition of the basic framework underlying all these approaches helps to clarify innovative lines of research, open problems, and most importantly, foster the exchanges of well-known results throughout different communities. WIREs Data Mining Knowl Discov 2017, 7:e1200. doi: 10.1002/widm.1200

256 citations


Journal ArticleDOI
TL;DR: An up‐to‐date survey of itemset mining problems and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining are discussed.
Abstract: Itemset mining is an important subfield of data mining, which consists of discovering interesting and useful patterns in transaction databases. The traditional task of frequent itemset mining is to discover groups of items (itemsets) that appear frequently together in transactions made by customers. Although itemset mining was designed for market basket analysis, it can be viewed more generally as the task of discovering groups of attribute values frequently cooccurring in databases. Because of its numerous applications in domains such as bioinformatics, text mining, product recommendation, e-learning, and web click stream analysis, itemset mining has become a popular research area. This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field. The problem of frequent itemset mining and its applications are described. Moreover, main approaches and strategies to solve itemset mining problems are presented, as well as their characteristics are provided. Limitations of traditional frequent itemset mining approaches are also highlighted, and extensions of the task of itemset mining are presented such as high-utility itemset mining, rare itemset mining, fuzzy itemset mining, and uncertain itemset mining. This study also discusses research opportunities and the relationship to other popular pattern mining problems, such as sequential pattern mining, episode mining, subgraph mining, and association rule mining. Main open-source libraries of itemset mining implementations are also briefly presented. WIREs Data Mining Knowl Discov 2017, 7:e1207. doi: 10.1002/widm.1207

197 citations


Journal ArticleDOI
TL;DR: Three of the main tasks facing this issue concern: (1) the detection of opinion spam in review sites, (2) the Detection of fake news and spam in microblogging, and (3) the credibility assessment of online health information.
Abstract: In the Social Web scenario, where large amounts of User Generated Content diffuse through Social Media, the risk of running into misinformation is not negligible For this reason, assessing and mining the credibility of both sources of information and information itself constitute nowadays a fundamental issue Credibility, also referred as believability, is a quality perceived by individuals, who are not always able to discern with their cognitive capacities genuine information from the fake one For this reason, in the recent years several approaches have been proposed to automatically assess credibility in Social Media Most of them are based on data-driven models, ie, they employ machine-learning techniques to identify misinformation, but recently also model-driven approaches are emerging, as well as graph-based approaches focusing on credibility propagation Since multiple social applications have been developed for different aims and in different contexts, several solutions have been considered to address the issue of credibility assessment in Social Media Three of the main tasks facing this issue and considered in this article concern: (1) the detection of opinion spam in review sites, (2) the detection of fake news and spam in microblogging, and (3) the credibility assessment of online health information Despite the high number of interesting solutions proposed in the literature to tackle the above three tasks, some issues remain unsolved; they mainly concern both the absence of predefined benchmarks and gold standard datasets, and the difficulty of collecting and mining large amount of data, which has not yet received the attention it deserves For further resources related to this article, please visit the WIREs website

159 citations


Journal ArticleDOI
TL;DR: This review helps interested readers to learn about enterprise data leak threats, recent data leak incidents, various state-of-the‐art prevention and detection techniques, new challenges, and promising solutions and exciting opportunities.
Abstract: A data breach is the intentional or inadvertent exposure of confidential information to unauthorized parties. In the digital era, data has become one of the most critical components of an enterprise. Data leakage poses serious threats to organizations, including significant reputational damage and financial losses. As the volume of data is growing exponentially and data breaches are happening more frequently than ever before, detecting and preventing data loss has become one of the most pressing security concerns for enterprises. Despite a plethora of research efforts on safeguarding sensitive information from being leaked, it remains an active research problem. This review helps interested readers to learn about enterprise data leak threats, recent data leak incidents, various state‐of‐the‐art prevention and detection techniques, new challenges, and promising solutions and exciting opportunities. WIREs Data Mining Knowl Discov 2017, 7:e1211. doi: 10.1002/widm.1211

148 citations


Journal ArticleDOI
TL;DR: In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributing frequent graphmining, distributed clustering, and privacy preserving of distributed data mining.
Abstract: Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining (DM) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216

125 citations


Journal ArticleDOI
TL;DR: This paper aims to provide the reader with a complete and comprehensive review of the existing literature that helps us understand the application of EDS in MOOCs.
Abstract: The current massive open online course (MOOC) euphoria is revolutionizing online education. Despite its expediency, there is considerable skepticism over various concerns. In order to resolve some of these problems, educational data science (EDS) has been used with success. MOOCs provide a wealth of information about the way in which a large number of learners interact with educational platforms and engage with the courses offered. This extensive amount of data provided by MOOCs concerning students' usage information is a gold mine for EDS. This paper aims to provide the reader with a complete and comprehensive review of the existing literature that helps us understand the application of EDS in MOOCs. The main works in this area are described and grouped by task or issue to be solved, along with the techniques used. WIREs Data Mining Knowl Discov 2017, 7:e1187. doi: 10.1002/widm.1187 This article is categorized under: Application Areas > Education and Learning

111 citations


Journal ArticleDOI
TL;DR: This contribution provides a review of the existing publications in time series motif discovery along with advantages and disadvantages of existing approaches and serves as a glossary for researchers in this field.
Abstract: Last decades witness a huge growth in medical applications, genetic analysis, and in performance of manufacturing technologies and automatised production systems. A challenging task is to identify and diagnose the behavior of such systems, which aim to produce a product with desired quality. In order to control the state of the systems, various information is gathered from different types of sensors (optical, acoustic, chemical, electric, and thermal). Time series data are a set of real-valued variables obtained chronologically. Data mining and machine learning help derive meaningful knowledge from time series. Such tasks include clustering, classification, anomaly detection and motif discovery. Motif discovery attempts to find meaningful, new, and unknown knowledge from data. Detection of motifs in a time series is beneficial for, e.g., discovery of rules or specific events in a signal. Motifs provide useful information for the user in order to model or analyze the data. Motif discovery is applied to various areas as telecommunication, medicine, web, motion-capture, and sensor networks. This contribution provides a review of the existing publications in time series motif discovery along with advantages and disadvantages of existing approaches. Moreover, the research issues and missing points in this field are highlighted. The main objective of this focus article is to serve as a glossary for researchers in this field.

85 citations


Journal ArticleDOI
TL;DR: Tensor-based recommender models push the boundaries of traditional collaborative filtering techniques by taking into account a multifaceted nature of real environments, which allows to produce more accurate, situational (e.g. context-aware, criteria-driven) recommendations.
Abstract: A substantial progress in development of new and efficient tensor factorization techniques has led to an extensive research of their applicability in recommender systems field. Tensor-based recommender models push the boundaries of traditional collaborative filtering techniques by taking into account a multifaceted nature of real environments, which allows to produce more accurate, situational (e.g. context-aware, criteria-driven) recommendations. Despite the promising results, tensor-based methods are poorly covered in existing recommender systems surveys. This survey aims to complement previous works and provide a comprehensive overview on the subject. To the best of our knowledge, this is the first attempt to consolidate studies from various application domains in an easily readable, digestible format, which helps to get a notion of the current state of the field. We also provide a high level discussion of the future perspectives and directions for further improvement of tensor-based recommendation systems.

66 citations


Journal ArticleDOI
TL;DR: A general guideline for finding an appropriate tensor representation, suitable tensor model, and interpretation are application dependent choices and illustrate them through successful applications in epilepsy.
Abstract: Electroencephalography EEG and functional magnetic resonance imaging fMRI record a mixture of ongoing neural processes, physiological and nonphysiological noise. The pattern of interest, such as epileptic activity, is often hidden within this noisy mixture. Therefore, blind source separation BSS techniques, which can retrieve the activity pattern of each underlying source, are very useful. Tensor decomposition techniques are very well suited to solve the BSS problem, as they provide a unique solution under mild constraints. Uniqueness is crucial for an unambiguous interpretation of the components, matching them to true neural processes and characterizing them using the component signatures. Moreover, tensors provide a natural representation of the inherently multidimensional EEG and fMRI, and preserve the structural information defined by the interdependencies among the various modes such as channels, time, patients, etc. Despite the well-developed theoretical framework, tensor-based analysis of real, large-scale clinical datasets is still scarce. Indeed, the application of tensor methods is not straightforward. Finding an appropriate tensor representation, suitable tensor model, and interpretation are application dependent choices, which require expertise both in neuroscience and in multilinear algebra. The aim of this paper is to provide a general guideline for these choices and illustrate them through successful applications in epilepsy. WIREs Data Mining Knowl Discov 2017, 7:e1197. doi: 10.1002/widm.1197

61 citations


Journal ArticleDOI
TL;DR: This paper examines the potentials of big data analytics for security intelligence under a criminal analytics framework, and examines some common data sources, analytics methods, and applications related to two important aspects of social network analysis namely, structural analysis and positional analysis that lay the foundation of criminal analytics.
Abstract: Applications of various data analytics technologies to security and criminal investigation during the past three decades have demonstrated the inception, growth, and maturation of criminal analytics. We first identify five cutting-edge data mining technologies such as link analysis, intelligent agents, text mining, neural networks, and machine learning. Then, we explore their recent applications to the criminal analytics domain, and discuss the challenges arising from these innovative applications. We also extend our study to big data analytics which provides some state-of-the-art technologies to reshape criminal investigations. In this paper, we review the recent literature, and examine the potentials of big data analytics for security intelligence under a criminal analytics framework. We examine some common data sources, analytics methods, and applications related to two important aspects of social network analysis namely, structural analysis and positional analysis that lay the foundation of criminal analytics. Another contribution of this paper is that we also advocate a novel criminal analytics methodology that is underpinned by big data analytics. We discuss the merits and challenges of applying big data analytics to the criminal analytics domain. Finally, we highlight the future research directions of big data analytics enhanced criminal investigations. WIREs Data Mining Knowl Discov 2017, 7:e1208. doi: 10.1002/widm.1208 This article is categorized under: Fundamental Concepts of Data and Knowledge > Data Concepts Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Computer Architectures for Data Mining

54 citations


Journal ArticleDOI
TL;DR: This study follows the guidelines of systematic literature review and applies it to the field of Web crawling, calling for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web.
Abstract: Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218

Journal ArticleDOI
TL;DR: Evaluating 19 open source data mining tools reveals that RapidMiner, Konstanz Information Miner, and Waikato Environment for Knowledge Analysis are the tools that include higher percentage of these features.
Abstract: The growing interest in the extraction of useful knowledge from data with the aim of being beneficial for the data owner is giving rise to multiple data mining tools. Research community is specially aware of the importance of open source data mining software to ensure and ease the dissemination of novel data mining algorithms. The availability of these tools at no cost, and also the chance of better understanding of the approaches by examining their source code, provides the research community with an opportunity to tune and improve the algorithms. Documentation, updating, variety of algorithms, extensibility, and interoperability among others can be major issues to motivate users for opting for a specific open source data mining tool. The aim of this paper is to evaluate 19 open source data mining tools and to provide the research community with an extensive study based on a wide set of features that any tool should satisfy. The evaluation is carried out by following two methodologies. The first one is based on scores provided by experts to produce a subjective judgment of each tool. The second procedure performs an objective analysis about which features are satisfied by each tool. The ultimate aim of this work is to provide the research community with an extensive study on different features included in any data mining tool, either from a subjective and an objective point of view. Results reveal that RapidMiner, Konstanz Information Miner, and Waikato Environment for Knowledge Analysis are the tools that include higher percentage of these features. For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: The results coming from the reviewed works indicate the promising capability of SDAs to perform sentiment recognition on a multitude of domains and languages.
Abstract: Deep learning has been shown to outperform numerous conventional machine learning algorithms (e.g., support vector machines) in many fields, such as image processing and text analyses. This is due to its outstanding capability to model complex data distributions. However, as networks become deeper, there is an increased risk of overfitting and higher sensitivity to noise. Stacked denoising autoencoders (SDAs) provide an infrastructure to resolve these issues. In the field of sentiment recognition from textual contents, SDAs have been widely used (especially for domain adaptation), and have been consistently refined and improved through defining new alternate topologies as well as different learning algorithms. A wide selection of these approaches are reviewed and compared in this article. The results coming from the reviewed works indicate the promising capability of SDAs to perform sentiment recognition on a multitude of domains and languages. For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: Theoretical and practical aspects of MM algorithm design are discussed and specific algorithms for these three examples are derived and Mathematical Programming Series A numerical demonstrations are presented.
Abstract: MM (majorization–minimization) algorithms are an increasingly popular tool for solving optimization problems in machine learning and statistical estimation. This article introduces the MM algorithm framework in general and via three commonly considered example applications: Gaussian mixture regressions, multinomial logistic regressions, and support vector machines. Specific algorithms for these three examples are derived and Mathematical Programming Series A numerical demonstrations are presented. Theoretical and practical aspects of MM algorithm design are discussed. WIREs Data Mining Knowl Discov 2017, 7:e1198. doi: 10.1002/widm.1198

Journal ArticleDOI
TL;DR: This study aims to discover communities in the multi‐relational networks through relational learning through the utilization of non‐negative tensor factorization and GA k‐means clustering for community discovery.
Abstract: The ubiquity of social networking sites leads to the generation of rich social media content. Community discovery is one of the significant tools in the analysis of social media data that is often multi-relational due to diverse forms of user interactions. Although there has been extensive research devoted to community discovery, most of it is restricted to single-relational networks. However, focus has been shifted to multi-relational networks in the recent years. In this study, we aim to discover communities in the multi-relational networks through relational learning. Our main focus is the utilization of non-negative tensor factorization and GA k-means clustering for community discovery. In order to incorporate the relational characteristics of the data in the learning methodology, tensors are used to model the multi-relational network. Tensor factorization reveals the latent features of the data and shows state-of-the-art results for multi-relational learning. Once the implicit information is obtained by factorization, we apply a GA k-means clustering algorithm for community discovery. Experiments are performed on synthetic as well as real datasets. The results obtained are quite promising and clearly demonstrate the effectiveness of our proposed scheme. WIREs Data Mining Knowl Discov 2017, 7:e1196. doi: 10.1002/widm.1196 This article is categorized under: Algorithmic Development > Structure Discovery Technologies > Machine Learning

Journal ArticleDOI
TL;DR: It is suggested that a conceptual scientific workflow‐based programming framework associated with an elastic cloud computing environment running big data tools (such as Hadoop and Spark) is a good choice for facilitating effective data mining and collaboration among scientists.
Abstract: With the development of applications and high-throughput sensor technologies in medical fields, scientists and scientific professionals are facing a big challenge—how to manage and analyze the big electrophysiological datasets created by these sensor technologies. The challenge exhibits several aspects: one is the size of the data (which is usually more than terabytes); the second is the format used to store the data (the data created are generally stored using different formats); the third is that most of these unstructured, semi-structured, or structured datasets are still distributed over many researchers' own local computers in their laboratories, which are not open access, to become isolated data islands. Thus, how to overcome the challenge and share/mine the scientific data has become an important research topic. The aim of this paper is to systematically review recent published research on the developed web-based electrophysiological data platforms from the perspective of cloud computing and programming frameworks. Based on this review, we suggest that a conceptual scientific workflow-based programming framework associated with an elastic cloud computing environment running big data tools (such as Hadoop and Spark) is a good choice for facilitating effective data mining and collaboration among scientists. WIREs Data Mining Knowl Discov 2017, 7:e1206. doi: 10.1002/widm.1206 For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: This survey focuses on the use of segmentation methods for extracting behavioral information from individual mobility data, in particular from spatial trajectories, e.g., semantic trajectories and symbolic trajectories.
Abstract: Segmentation techniques partition a sequence of data points into a series of disjoint subsequences—segments—based on some criteria. Depending on the context and the nature of data themselves, segments return an approximate representation. The final result is a summarized representation of the sequence. This intuitive mechanism has been extensively studied, for example, for the summarization of time series in order to preserve the ‘shape’ of the sequence while omitting irrelevant details. This survey focuses on the use of segmentation methods for extracting behavioral information from individual mobility data, in particular from spatial trajectories. Such information can then be given a compact representation in the form of summarized trajectories, e.g., semantic trajectories and symbolic trajectories. Two major streams of research are discussed, spanning computational geometry and data mining respectively, that are emblematic of the multiplicity of views. WIREs Data Mining Knowl Discov 2017, 7:e1214. doi: 10.1002/widm.1214

Journal ArticleDOI
TL;DR: This article categorizes the main aspects affecting human aging into a taxonomy for assisting data mining (DM) research on this topic, and analyzes the comprehensiveness of the main databases of longitudinal studies of human aging worldwide.
Abstract: Human aging is a global problem that will have a large socioeconomic impact. A better understanding of aging can direct public policies that minimize its negative effects in the future. Over many years, several longitudinal studies of human aging have been conducted aiming to comprehend the phenomenon, and various factors influencing human aging are under analysis. In this review, we categorize the main aspects affecting human aging into a taxonomy for assisting data mining (DM) research on this topic. We also present tables summarizing the main characteristics of 64 research articles using data from aging-related longitudinal studies, in terms of the aging-related aspects analyzed, the main data analysis techniques used, and the specific longitudinal database mined in each article. Finally, we analyze the comprehensiveness of the main databases of longitudinal studies of human aging worldwide, regarding which proportion of the proposed taxonomy's aspects are covered by each longitudinal database. We observed that most articles analyzing such data use classical (parametric, linear) statistical techniques, with little use of more modern (nonparametric, nonlinear) DM methods for analyzing longitudinal databases of human aging. We hope that this article will contribute to DM research in two ways: first, by drawing attention to the important problem of global aging and the free availability of several longitudinal databases of human aging; second, by providing useful information to make research design choices about mining such data, e.g., which longitudinal study and which types of aging-related aspects should be analyzed, depending on the research's goals.

Journal ArticleDOI
TL;DR: The proposed algorithm is based mainly on similar members’ actions rather than the structure similarity only for the aim of detecting communities that are closely mapped to the underlying behavioral communities in real social networks.
Abstract: Community detection has become a crucial task in social network mining. Detecting communities summarizes interactions between members for gaining deep understanding of interesting characteristics shared between members of the same community. In this research, we propose a novel community detection algorithm for the purpose of revealing and analyzing hidden similar behavior of online users. The proposed algorithm is based mainly on similar members’ actions rather than the structure similarity only for the aim of detecting communities that are closely mapped to the underlying behavioral communities in real social networks. First, leaders of the social network are discovered, then, communities are detected based on those leaders. The idea is grounded on the assumption that communities could be formed around people with great influence. Extensive experiments and analysis show the ability of the proposed algorithm to successfully detect real-world communities with improved accuracy. For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: This study surveys methods for modeling and analyzing online user behavior, focuses on negative behaviors (social spamming and cyberbullying) and mitigation techniques for these behaviors, and provides information on the interplay between privacy and deception in social networks.
Abstract: Mathematical modeling of social network interactions requires several variables all interacting over a graph structure.

Journal ArticleDOI
TL;DR: This work decomposes the problem into subproblems and develops various neural network architectures, all of which are purely data‐driven and capable of learning continuous representations of the mention and the entity from data.
Abstract: Entity disambiguation is a fundamental task in natural language processing and computational linguistics. Given a query consisting of a mention name string and a background document, entity disambi...


Journal ArticleDOI
TL;DR: This study surveys recent advances in approaches to model temporal information with a focus on the temporal perspective, and discusses advantages and challenges of each approach.
Abstract: Deceptive engagement in social media, such as spamming, commenting, or rating with automatic scripts, spreading fabricated facts, seriously affects users’ trust on online services. Given the large volumes of information generated by users, effectively spotting users involved such deceptive engagement has become a challenging problem. Recent research has shown that techniques for analyzing temporal behavioral patterns are critical to address such problem. In this study, we survey recent advances in these techniques. We first summarize three approaches to model temporal information. Then, by using representative application examples, we discuss recent approaches with respect to their applications to real-world large-scale social media. With a focus on the temporal perspective, we then discuss advantages and challenges of each approach. For further resources related to this article, please visit the WIREs website.