scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2013"


Journal ArticleDOI
TL;DR: Key milestones and the current state of affairs in the field of EDM are reviewed, together with specific applications, tools, and future insights.
Abstract: Applying data mining DM in education is an emerging interdisciplinary research field also known as educational data mining EDM. It is concerned with developing methods for exploring the unique types of data that come from educational environments. Its goal is to better understand how students learn and identify the settings in which they learn to improve educational outcomes and to gain insights into and explain educational phenomena. Educational information systems can store a huge amount of potential data from multiple sources coming in different formats and at different granularity levels. Each particular educational problem has a specific objective with special characteristics that require a different treatment of the mining problem. The issues mean that traditional DM techniques cannot be applied directly to these types of data and problems. As a consequence, the knowledge discovery process has to be adapted and some specific DM techniques are needed. This paper introduces and reviews key milestones and the current state of affairs in the field of EDM, together with specific applications, tools, and future insights. © 2012 Wiley Periodicals, Inc.

885 citations


Journal ArticleDOI
TL;DR: An overview of existing techniques with special focus on the treatment of multidimensional signals (tensors) achieving simple and efficient CS algorithms is given.
Abstract: Compressed sensing (CS) comprises a set of relatively new techniques that exploit the underlying structure of data sets allowing their reconstruction from compressed versions or incomplete information. CS reconstruction algorithms are essentially nonlinear, demanding heavy computation overhead and large storage memory, especially in the case of multidimensional signals. Excellent review papers discussing CS state-of-the-art theory and algorithms already exist in the literature, which mostly consider data sets in vector forms. In this paper, we give an overview of existing techniques with special focus on the treatment of multidimensional signals (tensors). We discuss recent trends that exploit the natural multidimensional structure of signals (tensors) achieving simple and efficient CS algorithms. The Kronecker structure of dictionaries is emphasized and its equivalence to the Tucker tensor decomposition is exploited allowing us to use tensor tools and models for CS. Several examples based on real world multidimensional signals are presented, illustrating common problems in signal processing such as the recovery of signals from compressed measurements for magnetic resonance imaging (MRI) signals or for hyper-spectral imaging, and the tensor completion problem (multidimensional inpainting). WIREs Data Mining Knowl Discov 2013, 3:355–380. doi: 10.1002/widm.1108 Conflict of interest: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

85 citations


Journal ArticleDOI
TL;DR: Different classes of methods that alone or (in many cases) combined accelerate genetics‐based machine learning methods are reviewed.
Abstract: In the last decade, genetics-based machine learning methods have shown their competence in large-scale data mining tasks because of the scalability capacity that these techniques have demonstrated. This capacity goes beyond the innate massive parallelism of evolutionary computation methods by the proposal of a variety of mechanisms specifically tailored for machine learning tasks, including knowledge representations that exploit regularities in the datasets, hardware accelerations or data-intensive computing methods, among others. This paper reviews different classes of methods that alone or in many cases combined accelerate genetics-based machine learning methods. © 2013 Wiley Periodicals, Inc.

74 citations


Journal ArticleDOI
TL;DR: This paper will try to show that FCA actually provides support for processing large dynamical complex data augmented with additional knowledge.
Abstract: During the last three decades, formal concept analysis (FCA) became a well-known formalism in data analysis and knowledge discovery because of its usefulness in important domains of knowledge discovery in databases (KDD) such as ontology engineering, association rule mining, machine learning, as well as relation to other established theories for representing knowledge processing, like description logics, conceptual graphs, and rough sets. In early days, FCA was sometimes misconceived as a static crisp hardly scalable formalism for binary data tables. In this paper, we will try to show that FCA actually provides support for processing large dynamical complex (may be uncertain) data augmented with additional knowledge. © 2013 Wiley Periodicals, Inc.

70 citations


Journal ArticleDOI
TL;DR: The use of different EC paradigms for feature selection in classification problems are reviewed including representation, evaluation, and validation to uncover the best EC algorithms for FSS and to point at future research directions.
Abstract: Feature subset selection (FSS) has received a great deal of attention in statistics, machine learning, and data mining. Real world data analyzed by data mining algorithms can involve a large number of redundant or irrelevant features or simply too many features for a learning algorithm to handle them efficiently. Feature selection is becoming essential as databases grow in size and complexity. The selection process is expected to bring benefits in terms of better performing models, computational efficiency, and simpler more understandable models. Evolutionary computation (EC) encompasses a number of naturally inspired techniques such as genetic algorithms, genetic programming, ant colony optimization, or particle swarm optimization algorithms. Such techniques are well suited to feature selection because the representation of a feature subset is straightforward and the evaluation can also be easily accomplished through the use of wrapper or filter algorithms. Furthermore, the capability of such heuristic algorithms to efficiently search large search spaces is of great advantage to the feature selection problem. Here, we review the use of different EC paradigms for feature selection in classification problems. We discuss details of each implementation including representation, evaluation, and validation. The review enables us to uncover the best EC algorithms for FSS and to point at future research directions. WIREs Data Mining Knowl Discov 2013, 3:381–407. doi: 10.1002/widm.1106 Conflict of interest: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

57 citations


Journal ArticleDOI
TL;DR: This paper provides a comprehensive survey on the state‐of‐the‐art algorithms for association rule mining, specially when the data sets used for rule mining are not static.
Abstract: Association rule mining is a computationally expensive task. Despite the huge processing cost, it has gained tremendous popularity due to the usefulness of association rules. Several efficient algorithms can be found in the literature. This paper provides a comprehensive survey on the state-of-the-art algorithms for association rule mining, specially when the data sets used for rule mining are not static. Addition of new data to a data set may lead to additional rules or to the modification of existing rules. Finding the association rules from the whole data set may lead to significant waste of time if the process has started from the scratch. Several algorithms have been evolved to attend this important issue of the association rule mining problem. This paper analyzes some of them to tackle the incremental association rule mining problem. © 2013 Wiley Periodicals, Inc.

56 citations


Journal ArticleDOI
TL;DR: Knowledge hiding is an emerging area of research focusing on appropriately modifying the data in such a way that sensitive knowledge escapes the mining and is not communicated to the public for privacy purposes as mentioned in this paper.
Abstract: Privacy preserving data mining has been recently introduced to cope with privacy issues related to the data subjects in the course of the mining of the data. It has also been recognized that it is not only the data that need to be protected but sensitive knowledge hidden in the data as well. Knowledge hiding is an emerging area of research focusing on appropriately modifying the data in such a way that sensitive knowledge escapes the mining and is not communicated to the public for privacy purposes. This article investigates the development of techniques falling under the knowledge-hiding umbrella that pertain to the association rule-mining task. These techniques are known as association rule hiding or frequent pattern hiding approaches, and have been receiving a lot of attention lately because they touch upon important issues of handling a sort of commonly used patterns such as the frequent patterns and the association rules. We present an overview of this area as well as a taxonomy and a presentation of an important sample of algorithms. © 2013 Wiley Periodicals, Inc.

52 citations


Journal ArticleDOI
TL;DR: This paper reviews these techniques according to the three Web‐mining categories above—fuzzy Web usage mining, fuzzy Web content mining, and fuzzy Web structure mining.
Abstract: The Internet has become an unlimited resource of knowledge, and is thus widely used in many applications. Web mining plays an important role in discovering such knowledge. This mining can be roughly divided into three categories, including Web usage mining, Web content mining, and Web structure mining. Data and knowledge on the Web may, however, consist of imprecise, incomplete, and uncertain data. Because fuzzy-set theory is often used to handle such data, several fuzzy Web-mining techniques have been proposed to reveal fuzzy and linguistic knowledge. This paper reviews these techniques according to the three Web-mining categories above—fuzzy Web usage mining, fuzzy Web content mining, and fuzzy Web structure mining. Some representative approaches in each category are introduced and compared. © 2013 Wiley Periodicals, Inc. This article is categorized under: Algorithmic Development > Web Mining Technologies > Computational Intelligence

47 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed topic‐based approach outperforms the state‐of‐the‐art profile‐ and document‐based models, which use information retrieval methods to rank experts, in the search space given a field of expertise as an input query.
Abstract: The task of expert finding is to rank the experts in the search space given a field of expertise as an input query. In this paper, we propose a topic modeling approach for this task. The proposed model uses latent Dirichlet allocation (LDA) to induce probabilistic topics. In the first step of our algorithm, the main topics of a document collection are extracted using LDA. The extracted topics present the connection between expert candidates and user queries. In the second step, the topics are used as a bridge to find the probability of selecting each candidate for a given query. The candidates are then ranked based on these probabilities. The experimental results on the Text REtrieval Conference (TREC) Enterprise track for 2005 and 2006 show that the proposed topic-based approach outperforms the state-of-the-art profile- and document-based models, which use information retrieval methods to rank experts. Moreover, we present the superiority of the proposed topic-based approach to the improved document-based expert finding systems, which consider additional information such as local context, candidate prior, and query expansion.

45 citations


Journal ArticleDOI
TL;DR: An overview of the historical evolution and the key concepts of stream processing, with special focus on adaptivity and Cloud‐based elasticity are provided.
Abstract: Stream processing is a computing paradigm that has emerged from the necessity of handling high volumes of data in real time. In contrast to traditional databases, stream-processing systems perform continuous queries and handle data on-the-fly. Today, a wide range of application areas relies on efficient pattern detection and queries over streams. The advent of Cloud computing fosters the development of elastic stream-processing platforms, which are able to dynamically adapt based on different cost–benefit trade-offs. This article provides an overview of the historical evolution and the key concepts of stream processing, with special focus on adaptivity and Cloud-based elasticity.

43 citations


Journal ArticleDOI
TL;DR: The main goal of this paper is to review the main methodologies to mine life sciences data and the ways they are coupled to high‐performance infrastructures and systems that result in an efficient analysis.
Abstract: Data mining (DM) is increasingly used in the analysis of data generated in life sciences, including biological data produced in several disciplines such as genomics and proteomics, medical data produced in clinical practice, and administrative data produced in health care. The difficulty in mining such data is twofold. First of all, data in life sciences are inherently heterogeneous, spanning from molecular level data to clinical and administrative data. Second, data in life sciences are produced at an increasing rate and data repositories are becoming very large. Thus, the management and analysis of such data is becoming a main bottleneck in biomedical research. The main goal of this paper is to review the main methodologies to mine life sciences data and the ways they are coupled to high-performance infrastructures and systems that result in an efficient analysis. This paper recalls basic concepts of DM, grids, and distributed DM on grids, and reviews main approaches to mine biomedical data on high-performance infrastructures with special focus on the analysis of genomics, proteomics, and interactomics data, and the exploration of magnetic resonance images in neurosciences. The paper can be of interest both to bioinformaticians, who can learn how to exploit high performance infrastructures to mine life sciences data, and to computer scientists, who can address the heterogeneity and the high volumes of life sciences data at the data management, algorithm, and user interface layers. © 2013 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This paper presents an abstract high‐level picture of combined mining and the combined patterns from the perspective of object and pattern relation analysis and discusses several fundamental aspects of combined pattern mining.
Abstract: Combined mining is a technique for analyzing object relations and pattern relations, and for extracting and constructing actionable knowledge (patterns or exceptions). Although combined patterns can be built within a single method, such as combined sequential patterns by aggregating relevant frequent sequences, this knowledge is composed of multiple constituent components (the left hand side) from multiple data sources, which are represented by different feature spaces, or identified by diverse modeling methods. In some cases, this knowledge is also associated with certain impacts (influence, action, or conclusion, on the right hand side). This paper presents an abstract high-level picture of combined mining and the combined patterns from the perspective of object and pattern relation analysis. Several fundamental aspects of combined pattern mining are discussed, including feature interaction, pattern interaction, pattern dynamics, pattern impact, pattern relation, pattern structure, pattern paradigm, pattern formation criteria, and pattern presentation (in terms of pattern ontology and pattern dynamic charts). We also briefly illustrate the concepts and discuss how they can be applied to mining complex data for complex knowledge in either a multifeature, multisource, or multimethod scenario. © 2013 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: Some of the current challenges in the analysis of large‐scale social network data include social network modeling and representation, link mining, sentiment analysis, semantic SNA, information diffusion, viral marketing, and influential node mining.
Abstract: Social network analysis (SNA) is a multidisciplinary field dedicated to the analysis and modeling of relations and diffusion processes among various objects in nature and society, and other information/knowledge processing entities with an aim of understanding how the behavior of individuals and their interactions translates into large-scale social phenomenon. Because of exploding popularity of online social networks and availability of huge amount of user-generated content, there is a great opportunity to analyze social networks and their dynamics at resolutions and levels not seen before. This has resulted in a significant increase in research literature at the intersection of the computing and social sciences leading to several techniques for social network modeling and analysis in the area of machine learning and data mining. Some of the current challenges in the analysis of large-scale social network data include social network modeling and representation, link mining, sentiment analysis, semantic SNA, information diffusion, viral marketing, and influential node mining. WIREs Data Mining Knowl Discov 2013, 3:408–444. doi: 10.1002/widm.1105 Conflict of interest: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: This review surveys what is possible, and also outlines current research directions forAbstract, structured, representations of knowledge such as lexicons, taxonomies, and ontologies, calling on the proliferation of interlinked resources already available on the web for background knowledge and general information about the world.
Abstract: structured, representations of knowledge such as lexicons, taxonomies, and ontologies have proven to be powerful resources not only for the system- atization of knowledge in general, but to support practical technologies of doc- ument organization, information retrieval, natural language understanding, and question-answering systems. These resources are extremely time consuming for people to create and maintain, yet demand for them is growing, particularly in specialized areas ranging from legacy documents of large enterprises to rapidly changing domains such as current affairs and celebrity news. Consequently, re- searchers are investigating methods of creating such structures automatically from document collections, calling on the proliferation of interlinked resources already available on the web for background knowledge and general information about the world. This review surveys what is possible, and also outlines current research directions. C

Journal ArticleDOI
TL;DR: The main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations.
Abstract: Advancements in computer and communication technologies demand new perceptions of distributed computing environments and development of distributed data sources for storing voluminous amount of data. In such circumstances, mining multiple data sources for extracting useful patterns of significance is being considered as a challenging task within the data mining community. The domain, multi-database mining (MDM) is regarded as a promising research area as evidenced by numerous research attempts in the recent past. The methods exist for discovering knowledge from multiple data sources, they fall into two wide categories, namely (1) mono-database mining and (2) local pattern analysis. The main intent of the survey is to explain the idea behind those approaches and consolidate the research contributions along with their significance and limitations.

Journal ArticleDOI
TL;DR: This article highlights the research on data visualization and visual analytics techniques and highlights existing visual Analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.
Abstract: In the past decade, the analysis of data has faced the challenge of dealing with very large and complex datasets and the real-time generation of data. Technologies to store and access these complex and large datasets are in place. However, robust and scalable analysis technologies are needed to extract meaningful information from these datasets. The research field of Information Visualization and Visual Data Analytics addresses this need. Information visualization and data mining are often used complementary to each other. Their common goal is the extraction of meaningful information from complex and possibly large data. However, though data mining focuses on the usage of silicon hardware, visualization techniques also aim to access the powerful image-processing capabilities of the human brain. This article highlights the research on data visualization and visual analytics techniques. Furthermore, we highlight existing visual analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.

Journal ArticleDOI
TL;DR: Recent advances in Bayesian modeling for trees are reviewed, from simple Bayesian CART models, treed Gaussian process, sequential inference via dynamic trees, to ensemble modeling via Bayesian additive regression trees (BART).
Abstract: Tree-based regression and classification, popularized in the 1980s with the advent of the classification and regression trees (CART) has seen a recent resurgence in popularity alongside a boom in modern computing power. The new methodologies take advantage of simulation-based inference, and ensemble methods, to produce higher fidelity response surfaces with competitive out-of-sample predictive performance while retaining many of the attractive features of classic trees: thrifty divide-and-conquer nonparametric inference, variable selection and sensitivity analysis, and nonstationary modeling features. In this paper, we review recent advances in Bayesian modeling for trees, from simple Bayesian CART models, treed Gaussian process, sequential inference via dynamic trees, to ensemble modeling via Bayesian additive regression trees (BART). We outline open source R packages supporting these methods and illustrate their use.

Journal ArticleDOI
TL;DR: The process of evolutionary design of DTs is reviewed, providing the description of the most common approaches as well as referring to recognized specializations.
Abstract: Decision tree (DT) is one of the most popular symbolic machine learning approaches to classification with a wide range of applications. Decision trees are especially attractive in data mining. It has an intuitive representation and is, therefore, easy to understand and interpret, also by nontechnical experts. The most important and critical aspect of DTs is the process of their construction. Several induction algorithms exist that use the recursive top-down principle to divide training objects into subgroups based on different statistical measures in order to achieve homogeneous subgroups. Although being robust and fast, generally providing good results, their deterministic and heuristic nature can lead to suboptimal solutions. Therefore, alternative approaches have developed which try to overcome the drawbacks of classical induction. One of the most viable approaches seems to be the use of evolutionary algorithms, which can produce better DTs as they are searching for globally optimal solutions, evaluating potential solutions with regard to different criteria. We review the process of evolutionary design of DTs, providing the description of the most common approaches as well as referring to recognized specializations. The overall process is first explained and later demonstrated in a step-by-step case study using a dataset from the University of California, Irvine (UCI) machine learning repository. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: The relevance and possible applications of evolutionary algorithms, particularly genetic algorithms, in the domain of knowledge discovery in databases are discussed and some of the genetic‐based classification rule discovery methods based on fidelity criterion are presented.
Abstract: This paper discusses the relevance and possible applications of evolutionary algorithms, particularly genetic algorithms, in the domain of knowledge discovery in databases. Knowledge discovery in databases is a process of discovering knowledge along with its validity, novelty, and potentiality. Various genetic-based feature selection algorithms with their pros and cons are discussed in this article. Rule (a kind of high-level representation of knowledge) discovery from databases, posed as single and multiobjective problems is a difficult optimization problem. Here, we present a review of some of the genetic-based classification rule discovery methods based on fidelity criterion. The intractable nature of fuzzy rule mining using single and multiobjective genetic algorithms reported in the literatures is reviewed. An extensive list of relevant and useful references are given for further research. © 2013 Wiley Periodicals, Inc. This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Computational Intelligence

Journal ArticleDOI
TL;DR: The paper proposes Market Basket Analysis algorithm with MapReduce as well as apriority property by adapting an existing Apriori‐algorithm and building a simple algorithm that sorts data sets and converts it to (key, value) pairs to fit with Map Reduce.
Abstract: The MapReduce approach has been popular in computing large scale data since Google implemented its platform on Google Distributed File Systems (GFS) followed by Amazon Web Service (AWS) providing the Apache Hadoop platform in inexpensive computing nodes. Map/Reduce motivates to redesign and convert the existing sequential algorithms to MapReduce as restricted parallel programming so that the paper proposes Market Basket Analysis algorithm with MapReduce as well as apriority property. Two algorithms are proposed by adapting an existing Apriori-algorithm and building a simple algorithm that sorts data sets and converts it to (key, value) pairs to fit with MapReduce. It is executed on Amazon EC2 Map/Reduce platform. The experimental results show that the Apriori-algorithm does not perform as well as the simple algorithm. Using the simple algorithm, the code with Map/Reduce increases the performance by adding more nodes, but at a certain point there is a bottleneck that does not allow further performance gain. It is believed that the operations of distributing, aggregating, and reducing data in Map/Reduce, cause the bottleneck. WIREs Data Mining Knowl Discov 2013, 3:445–452. doi: 10.1002/widm.1107 For further resources related to this article, please visit the WIREs website.

Journal ArticleDOI
TL;DR: The objective of this research is to identify high‐value markets by using the data mining technologies and a new model based on ‘Recency–Frequency–Monetary’ (RFM) model to process customer value markets for leisure coffee‐shop industry.
Abstract: The objective of this research is to identify high-value markets by using the data mining technologies and a new model. The well-known Fuzzy C-Means algorithm is applied to process the market segmentation of the customer benefit market; and a new model [based on ‘Recency–Frequency–Monetary’ (RFM) model] is applied to process customer value markets for leisure coffee-shop industry. The results show the relationships between the two types of markets (benefit and customer value), which are presented by fuzzy and nonfuzzy association rules. These rules can be applied to customer relationship management systems for obtaining useful and high-value markets. The results can help leisure coffee-shop industry to acquire knowledge of customers, and to identify the explicit customer values for marketing plans. © 2013 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: A general framework for the discovery of a taxonomy of process models at different abstraction levels is presented, and a survey of different kinds of basic techniques that can be exploited to this purpose is offered.
Abstract: Modeling behavioral aspects of business processes is a hard and costly task, which usually requires heavy intervention of business experts. This explains the increasing attention given to process mining techniques, which automatically extract behavioral process models from log data. In the case of complex processes, however, the models identified by classical process mining techniques are hardly useful to analyze business operations at a suitable abstraction level. In fact, the need of process abstraction emerged in several application scenarios, and abstraction methods are already supported in some business-management platforms, which allow users to manually define abstract views for the process at hand. Therefore, it comes with no surprise that process mining research recently considered the issue of mining processes at different abstraction levels, mainly in the form of a taxonomy of process models, as to overcome the drawbacks of traditional approaches. This paper presents a general framework for the discovery of such a taxonomy, and offers a survey of different kinds of basic techniques that can be exploited to this purpose: (1) workflow modeling and discovery techniques, (2) clustering techniques enabling the discovery of different behavioral process classes, and (3) activity abstraction techniques for associating a generalized process model with each higher level taxonomy node. © 2013 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This review discusses different aspects of host–pathogen interactions (HPIs) and the available data resources and tools used to study them, and discusses in detail models of HPIs at various levels of abstraction, along with their applications and limitations.
Abstract: The rapid emergence of infectious diseases calls for immediate attention to determine practical solutions for intervention strategies. To this end, it becomes necessary to obtain a holistic view of the complex hostpathogen interactome. Advances in omics and related technology have resulted in massive generation of data for the interacting systems at unprecedented levels of detail. Systems-level studies with the aid of mathematical tools contribute to a deeper understanding of biological systems, where intuitive reasoning alone does not suffice. In this review, we discuss different aspects of hostpathogen interactions (HPIs) and the available data resources and tools used to study them. We discuss in detail models of HPIs at various levels of abstraction, along with their applications and limitations. We also enlist a few case studies, which incorporate different modeling approaches, providing significant insights into disease. (c) 2013 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This review is devoted to structure‐based approaches to the comparative analysis of proteins and the inference of protein function that rely on graph formalisms for modeling protein structures and employ graph‐theoretic algorithms for analyzing and comparing such structures.
Abstract: While sequence-based methods are widely used as reliable tools for protein function prediction in general, these methods are likely to fail in cases of low sequence similarity. This is due to the fact that proteins with low sequence similarity may nevertheless have similar functions and exhibit similar structures. In such cases, structure-based comparison methods can help to provide further insights owing to the widely accepted paradigm that structure mirrors function. Moreover, thanks to the steady increase in structural information with the advent of structural genomic projects and the steady improvements in structure prediction, these methods are becoming more and more applicable. Many structure-based approaches to the comparative analysis of proteins and the inference of protein function rely on graph formalisms for modeling protein structures and, correspondingly, employ graph-theoretic algorithms for analyzing and comparing such structures. This review is devoted to approaches of that kind and presents an overview of the most important graph-based algorithms.

Journal ArticleDOI
TL;DR: The development of key techniques that solve the three main subproblems of PPRL, namely privacy, linkage quality, and scaling P PRL to large databases are discussed and open challenges in this research area are highlighted.
Abstract: It has been recognized that sharing data between organizations can be of great benefit, since it can help discover novel and valuable information that is not available in individual databases. However, as organizations are under pressure to better utilize their large databases through sharing, integration, and analysis, protecting the privacy of personal information in such databases is an increasingly difficult task. Record linkage is the task of identifying and matching records that correspond to the same real-world entity in several databases. This task implies a crucial infrastructure component in many modern information systems. Privacy and confidentiality concerns, however, commonly prevent the matching of databases that contain personal information across different organizations. In the past decade, efforts in the research area of privacy-preserving record linkage (PPRL) have aimed to develop techniques that facilitate the matching of records across databases such that besides the matched records no private or confidential information is being revealed to any organisztion involved in such a linkage, or to any external party. We discuss the development of key techniques that solve the three main subproblems of PPRL, namely privacy, linkage quality, and scaling PPRL to large databases. We then highlight open challenges in this research area.

Journal ArticleDOI
TL;DR: Several analytical techniques are reviewed that aim to estimate suppressed employment data in large regional or local economic databases, such as County Business Patterns, using goal‐programming optimization models and transfer individual‐level transportation data gathered in national surveys of transportation behavior to construct reliable estimates for local area units (Census tracts), using clustering and regression techniques.
Abstract: Several analytical techniques are reviewed that aim to (1) estimate suppressed employment data in large regional or local economic databases, such as County Business Patterns, using goal-programming optimization models, (2) estimate local population data, using regional Census data, remotely sensed and traditional data, and statistical modeling, and (3) transfer individual-level transportation data gathered in national surveys of transportation behavior to construct reliable estimates for local area units (Census tracts), using clustering and regression techniques. These methodologies are illustrative of the rapidly expanding opportunities for improving socioeconomic databases, using new data sources and new and older techniques in innovative ways, thus contributing to knowledge discovery.

Journal ArticleDOI
TL;DR: Cluster analysis and seriation are basic data mining techniques that enable us to achieve both a clustering and a seriation, namely the hierarchical, pyramidal, and prepyramidal clustering structures.
Abstract: Cluster analysis and seriation are basic data mining techniques. We consider here clustering structures that enable us to achieve both a clustering and a seriation, namely the hierarchical, pyramidal, and prepyramidal clustering structures. Cluster collections of these types determine seriations of any data set by providing compatible orders, i.e., total rankings of the whole data set for which the objects within each cluster are consecutive. Moreover, the dissimilarity measures induced from such cluster collections are Robinsonian; in other words, the more distant the objects in a compatible order, the higher the induced dissimilarity value. It results in a one-to-one correspondence between weakly indexed pyramids and the class of Robinsonian dissimilarities, which holds still in a general setting where undistinguishable objects can be detected, and, as shown in this paper, extends, to prepyramids, which are not required to contain arbitrary intersections of clusters.

Journal ArticleDOI
TL;DR: This review is structured as follows: a brief description of the biological interactions is provided, currently available technologies for measuring genomic characterizations are outlined, and currently available approaches for genetic regulatory network modeling are outlined.
Abstract: The regulatory interactions in a cell form a complicated system, and an important goal of systems biology is to model and infer these interactions. The modeling and inference of genetic regulatory models requires understanding of the true biological interactions while incorporating the technological limitations on observation of the biological entities in estimation of a robust model. This review is structured as follows: (a) a brief description of the biological interactions is provided, (b) currently available technologies for measuring genomic characterizations are outlined, (c) followed by the description of the commonly used approaches for genetic regulatory network modeling, and (d) finally, some of the pertinent issues in modeling and inference of genetic interactions are discussed. WIREs Data Mining Knowl Discov 2013, 3:453–466. doi: 10.1002/widm.1103 The author has declared no conflicts of interest in relation to this article. For further resources related to this article, please visit the WIREs website.

Journal IssueDOI
TL;DR: The use of different EC paradigms for feature selection in classification problems are reviewed including representation, evaluation, and validation to uncover the best EC algorithms for FSS and to point at future research directions.
Abstract: Feature subset selection (FSS) has received a great deal of attention in statistics, machine learning, and data mining. Real world data analyzed by data mining algorithms can involve a large number of redundant or irrelevant features or simply too many features for a learning algorithm to handle them efficiently. Feature selection is becoming essential as databases grow in size and complexity. The selection process is expected to bring benefits in terms of better performing models, computational efficiency, and simpler more understandable models. Evolutionary computation (EC) encompasses a number of naturally inspired techniques such as genetic algorithms, genetic programming, ant colony optimization, or particle swarm optimization algorithms. Such techniques are well suited to feature selection because the representation of a feature subset is straightforward and the evaluation can also be easily accomplished through the use of wrapper or filter algorithms. Furthermore, the capability of such heuristic algorithms to efficiently search large search spaces is of great advantage to the feature selection problem. Here, we review the use of different EC paradigms for feature selection in classification problems. We discuss details of each implementation including representation, evaluation, and validation. The review enables us to uncover the best EC algorithms for FSS and to point at future research directions.