scispace - formally typeset
Search or ask a question

Showing papers in "Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery in 2012"


Journal ArticleDOI
Xinwei Deng1
TL;DR: Experimental design is reviewed here for broad classes of data collection and analysis problems, including: fractioning techniques based on orthogonal arrays, Latin hypercube designs and their variants for computer experimentation, efficient design for data mining and machine learning applications, and sequential design for active learning.
Abstract: Maximizing data information requires careful selection, termed design, of the points at which data are observed. Experimental design is reviewed here for broad classes of data collection and analysis problems, including: fractioning techniques based on orthogonal arrays, Latin hypercube designs and their variants for computer experimentation, efficient design for data mining and machine learning applications, and sequential design for active learning. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

1,025 citations


Journal ArticleDOI
TL;DR: A recently developed very efficient (linear time) hierarchical clustering algorithm is described, which can also be viewed as a hierarchical grid‐based algorithm.
Abstract: We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally, we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview, Wiley Interdiscip Rev: Data Mining Knowl Discov 2012, 2, 86–97. WIREs Data Mining Knowl Discov 2017, 7:e1219. doi: 10.1002/widm.1219 This article is categorized under: Algorithmic Development > Hierarchies and Trees Technologies > Classification Technologies > Structure Discovery and Clustering

977 citations


Journal ArticleDOI
TL;DR: In this paper, the authors focus on the importance of maintaining a proper alignment between event logs and process models and elaborate on the realization of such alignments and their application to conformance checking and performance analysis.
Abstract: Process mining techniques use event data to discover process models, to check the conformance of predefined process models, and to extend such models with information about bottlenecks, decisions, and resource usage. These techniques are driven by observed events rather than hand-made models. Event logs are used to learn and enrich process models. By replaying history using the model, it is possible to establish a precise relationship between events and model elements. This relationship can be used to check conformance and to analyze performance. For example, it is possible to diagnose deviations from the modeled behavior. The severity of each deviation can be quantified. Moreover, the relationship established during replay and the timestamps in the event log can be combined to show bottlenecks. These examples illustrate the importance of maintaining a proper alignment between event log and process model. Therefore, we elaborate on the realization of such alignments and their application to conformance checking and performance analysis. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

632 citations


Journal ArticleDOI
TL;DR: This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology and some representative examples of RF applications in this context and possible directions for future research.
Abstract: The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

613 citations


Journal ArticleDOI
TL;DR: This paper provides an overview of the foundations of frequent item set mining, starting from a definition of the basic notions and the core task, and discusses how the search space is structured to avoid redundant search, how the output is reduced by confining it to closed or maximal item sets or generators.
Abstract: Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for almost any task that requires discovering regularities between (nominal) variables. This paper provides an overview of the foundations of frequent item set mining, starting from a definition of the basic notions and the core task. It continues by discussing how the search space is structured to avoid redundant search, how it is pruned with the a priori property, and how the output is reduced by confining it to closed or maximal item sets or generators. In addition, it reviews some of the most important algorithmic techniques and data structures that were developed to make the search for frequent item sets as efficient as possible. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

265 citations


Journal ArticleDOI
Ira Assent1
TL;DR: An overview of the effects of high‐dimensional spaces, and their implications for different clustering paradigms is provided, with pointers to the literature, and open research issues remain.
Abstract: High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high-dimensional objects appear almost alike, new approaches for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster groups objects that are similar to one another, whereas dissimilar objects are assigned to different clusters, possibly separating out noise. In this manner, clusters describe the data structure in an unsupervised manner, i.e., without the need for class labels. A number of clustering paradigms exist that provide different cluster models and different algorithmic approaches for cluster detection. Common to all approaches is the fact that they require some underlying assessment of similarity between data objects. In this article, we provide an overview of the effects of high-dimensional spaces, and their implications for different clustering paradigms. We review models and algorithms that address clustering in high dimensions, with pointers to the literature, and sketch open research issues. We conclude with a summary of the state of the art. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

141 citations


Journal ArticleDOI
TL;DR: This paper attempts to provide a general and succinct overview of the essentials of social network analysis for those interested in taking a first look at this area and oriented to use data mining in social networks.
Abstract: Data mining is being increasingly applied to social networks. Two relevant reasons are the growing availability of large volumes of relational data, boosted by the proliferation of social media web sites, and the intuition that an individual's connections can yield richer information than his/her isolate attributes. This synergistic combination can show to be germane to a variety of applications such as churn prediction, fraud detection and marketing campaigns. This paper attempts to provide a general and succinct overview of the essentials of social network analysis for those interested in taking a first look at this area and oriented to use data mining in social networks. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

137 citations


Journal ArticleDOI
TL;DR: This paper reviews key milestones and state of the art in the data stream mining area and future insights are presented.
Abstract: Mining data streams has been a focal point of research interest over the past decade. Hardware and software advances have contributed to the significance of this area of research by introducing faster than ever data generation. This rapidly generated data has been termed as data streams. Credit card transactions, Google searches, phone calls in a city, and many others are typical data streams. In many important applications, it is inevitable to analyze this streaming data in real time. Traditional data mining techniques have fallen short in addressing the needs of data stream mining. Randomization, approximation, and adaptation have been used extensively in developing new techniques or adopting exiting ones to enable them to operate in a streaming environment. This paper reviews key milestones and state of the art in the data stream mining area. Future insights are also be presented.

98 citations


Journal ArticleDOI
TL;DR: Recent visualization techniques aimed at supporting tasks that require the analysis of text documents are reviewed, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.
Abstract: We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input material—either single texts or collections of texts—and their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

61 citations


Journal ArticleDOI
TL;DR: Thorough and innovative retrospection and thinking are timely in bridging the gaps and promoting data mining toward next‐generation research and development: namely, the paradigm shift from knowledge discovery from data to actionable knowledge discovery and delivery.
Abstract: Actionable knowledge has been qualitatively and intensively studied in the social sciences. Its marriage with data mining is only a recent story. On the one hand, data mining has been booming for a while and has attracted an increasing variety of increasing applications. On the other, it is a reality that the so-called knowledge discovered from data by following the classic frameworks often cannot support meaningful decision-making actions. This shows the poor relationship and significant gap between data mining research and practice, and between knowledge, power, and action, and forms an increasing imbalance between research outcomes and business needs. Thorough and innovative retrospection and thinking are timely in bridging the gaps and promoting data mining toward next-generation research and development: namely, the paradigm shift from knowledge discovery from data to actionable knowledge discovery and delivery. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

53 citations


Journal ArticleDOI
TL;DR: A brief overview of clustering is given, well‐known partitional clustering methods are summarized, the major challenges and key issues of these methods are discussed, and simple numerical experiments using toy data sets are carried out to enhance the description of various clustering Methods.
Abstract: Partitional clustering is an important part of cluster analysis. Cluster analysis can be considered as one of the the most important approaches to unsupervised learning. The goal of clustering is to find clusters from unlabeled data, which means that data belonging to the same cluster are as similar as possible, whereas data belonging to different clusters are as dissimilar as possible. Partitional clustering is categorized as a prototype-based model, i.e., each cluster can be represented by a prototype, leading to a concise description of the original data set. According to different definitions of prototypes, such as data point, hyperplane, and hypersphere, the clustering methods can be categorized into different types of clustering algorithms with various prototypes. Besides organizing these partitional clustering methods into such a unified framework, relations between some commonly used nonpartitional clustering methods and partitional clustering methods are also discussed here. We give a brief overview of clustering, summarize well-known partitional clustering methods, and discuss the major challenges and key issues of these methods. Simple numerical experiments using toy data sets are carried out to enhance the description of various clustering methods. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: An advanced review of regression tree methods for mining data streams summarizes the performance results of the reviewed methods and crystallizes 10 requirements for successful implementation of a regression tree algorithm in data stream mining area.
Abstract: This paper presents an advanced review of regression tree methods for mining data streams. Batch regression tree methods are known for their simplicity, interpretability, accuracy, and efficiency. They use fast divide-and-conquer greedy algorithms that recursively partition the given training data into smaller subsets. The result is a tree-shaped model with splitting rules in the internal nodes and predictions in the leaves. Most batch regression tree methods take a complete dataset and build a model using that data. Generally, this tree model cannot be modified if new data is acquired later. Their successors, the incremental model and interval trees algorithms, are able to build and retrain a model on a step-by-step basis by incorporating new numerical training instances into the model as they become available. Moreover, these algorithms produce even more compact and accurate models than batch regression tree algorithms because they use intervals or functional models with a change detection mechanism, which makes them a more suitable choice for regression analysis of data streams. Finally, this review summarizes the performance results of the reviewed methods and crystallizes 10 requirements for successful implementation of a regression tree algorithm in data stream mining area. © 2011 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This focus article considers mining approaches concerning social media in social networks and organizations and the analysis of such data, and describes the VIKAMINE system for mining communities and subgroups in socialMedia in the sketched application domains.
Abstract: Social media is the key component of social networks and organizational social applications. The emergence of new systems and services has created a number of novel social and ubiquitous environments for mining information, data, and, finally, knowledge. This connects but also transcends private and business applications featuring a range of different types of networks and organizational contexts. Important structures concern subgroups emerging in those applications as communities (connecting people), roles and key actors in the networks and communities, and opinions, beliefs, and sentiments of the set of actors. Collective intelligence can then be considered as an emerging phenomenon of the different interactions. This focus article considers mining approaches concerning social media in social networks and organizations and the analysis of such data. We first summarize important terms and concepts. Next, we describe and discuss key actor identification and characterization, sentiment mining and analysis, and community mining. In the sequel we consider different application areas and briefly discuss two exemplary ubiquitous and social applications—the social conference guidance system Conferator, and the MyGroup system for supporting working groups. Furthermore, we describe the VIKAMINE system for mining communities and subgroups in social media in the sketched application domains. Finally, we conclude with a discussion and outlook. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks.
Abstract: In the recent years, the area of data mining has been experiencing considerable demand for technologies that extract knowledge from large and complex data sources. There has been substantial commercial interest as well as active research in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from large datasets. Artificial neural networks (NNs) are popular biologically-inspired intelligent methodologies, whose classification, prediction, and pattern recognition capabilities have been utilized successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: Different methods for assessing the unexpectedness of patterns with a special focus on frequent itemsets, tiles, association rules, and classification rules are surveyed, namely, syntactical and probabilistic approaches.
Abstract: Knowledge discovery methods often discover a large number of patterns. Although this can be considered of interest, it certainly presents considerable challenges too. Indeed, this set of patterns often contains lots of uninteresting patterns that risk overwhelming the data miner. In addition, a single interesting pattern can be discovered in a multitude of tiny variations that for all practical purposes are redundant. These issues are referred to as the pattern explosion problem. They lie at the basis of much recent research attempting to quantify interestingness and redundancy between patterns, with the purpose of filtering down a large pattern set to an interesting and compact subset. Many diverse approaches to interestingness and corresponding interestingness measures (IMs) have been proposed in the literature. Some of them, named objective IMs, define interestingness only based on objective criteria of the pattern and data at hand. Subjective IMs additionally depend on the user's prior knowledge about the dataset. Formalizing unexpectedness is probably the most common approach for defining subjective IMs, where a pattern is deemed unexpected if it contradicts the user's expectations about the dataset. Such subjective IMs based on unexpectedness form the focus of this paper. We categorize measures based on unexpectedness into two major subgroups, namely, syntactical and probabilistic approaches. Based on this distinction, we survey different methods for assessing the unexpectedness of patterns with a special focus on frequent itemsets, tiles, association rules, and classification rules. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: Two techniques, one with fixed number of clusters and another with a variable number of fuzzy clusters, are described along with some experimental results on numerical as well as image data sets.
Abstract: Clustering has been an area of intensive research for several decades because of its multifaceted applications in innumerable domains. Clustering can be either Boolean, where a single data point belongs to exactly one cluster, or fuzzy, where a single data point can have nonzero belongingness to more than one cluster. Traditionally, optimization of some well-defined objective function has been the standard approach in both clustering and fuzzy clustering. Hence, researchers have investigated the utility of evolutionary computing and related techniques in this regard. The different approaches differ in their choice of the objective function and/or the optimization strategy used. In particular, clustering using genetic algorithms (GAs) has attracted attention of researchers, and has been studied extensively. This paper presents a short review of some of different approaches of GA-based clustering methods. Two techniques, one with fixed number of clusters and another with a variable number of fuzzy clusters, are described along with some experimental results on numerical as well as image data sets. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 524–531 DOI: 10.1002/widm.47

Journal ArticleDOI
TL;DR: Authors propose ontology‐based data warehousing and data mining technologies in which conceptualization and contextualization of multiple data dimensions' are modeled (elements and processes of petroleum system), besides their integration (within data warehouse environment) and datamining of interpretable emerging petroleum digital ecosystem.
Abstract: Petroleum-bearing sedimentary basin is an emerging digital ecosystem. Petroleum system and its elements are narrated for each and every oil and gas field in each and every petroleum-bearing sedimentary basin. A new concept of ecosystem and its digitization are emerging within the generic petroleum system. The Significance of this concept comes with an ability to establish connectivity among petroleum subsystems, for example, Albertine Graben (Western Uganda), through attributes of their components and their contextualization and specifications. Unless the phenomenon of integration of elements and processes among subbasins in the context of digital representation, visualization, and interpretation are understood, the existence of petroleum system (both at local and global scales) and its survival, the phenomenon of ‘interconnectivity’ among ecosystems cannot be well explained. Its value cannot be added in terms of petroleum accumulations and volumes, unless these phenomena are made explicit. Authors propose ontology-based data warehousing and data mining technologies in which conceptualization and contextualization of multiple data dimensions' are modeled (elements and processes of petroleum system), besides their integration (within data warehouse environment) and data mining of interpretable emerging petroleum digital ecosystem. Multidimensional data warehousing and mining facilitate an effective interpretation of petroleum ecosystems, minimizing the ambiguities involved during structure and reservoir qualifications and quantifications, especially during reserve computations. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This work focuses on the problem of predicting physically interacting protein pairs, and categorizes those methods into three types: pairwise data‐based, network‐ based, and integrative approaches, each approach being described in a different section.
Abstract: Proteins are important cellular molecules, and interacting protein pairs provide biologically important information, such as functional relationships. We focus on the problem of predicting physically interacting protein pairs. This is an important problem in biology, which has been actively investigated in the field of data mining and knowledge discovery. Our particular focus is on data-mining-based methods, and the objective of this review is to introduce these methods for data mining researchers from technical viewpoints. We categorize those methods into three types: pairwise data-based, network-based, and integrative approaches, each approach being described in a different section. The first section is further divided into five types, such as supervised learning, algorithmic approaches, and unsupervised learning. The second section is mainly on link prediction, which can be further divided into two types, and two subsections that cover topics related with protein interaction networks are further added. The final section provides a wide variety of methods in integrative approaches. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: The description of different evolutionary DT induction approaches is presented, and multiple examples of evolutionary DT applications in the medical domain are presented.
Abstract: Decision trees (DT) are a type of data classifiers. A typical classifier works in two phases. In the first, the learning phase, the classifier is built according to a preexisting data (training) set. Because decision trees are being induced from a known training set, and the labels on each example are known the first step can also be referred to as supervised learning. The second step is when the induced classifier is used for classification. Usually, prior to the first step several steps should be performed to improve the accuracy and efficiency of the classification: data cleaning, redundancy elimination, and data normalization. Classifiers are evaluated for accuracy, speed, robustness, scalability, and interpretability. DTs are widely used for exploratory knowledge discovery where comprehensible knowledge representation is preferred. The main attraction of DTs lies in the intuitive representation that is easy to understand and comprehend. Accuracy, however, is dependent on the learning data. One of the methods to improve the induction and other phases in the creation of a classifier is the use of evolutionary algorithms. They are used because the classic deterministic approach is not necessarily optimal with regard to the quality, accuracy, and complexity of the obtained classifier. In addition to the description of different evolutionary DT induction approaches, this paper also presents multiple examples of evolutionary DT applications in the medical domain. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: An introduction to the basic biological background and the general idea of GRN inferences is provided and different methods to infer GRNs are surveyed from two aspects: models that these methods are based on and inference algorithms that those methods use.
Abstract: Reverse engineering of gene regulatory networks (GRNs) is one of the most challenging tasks in systems biology and bioinformatics. It aims at revealing network topologies and regulation relationships between components from biological data. Owing to the development of biotechnologies, various types of biological data are collected from experiments. With the availability of these data, many methods have been developed to infer GRNs. This paper firstly provides an introduction to the basic biological background and the general idea of GRN inferences. Then, different methods are surveyed from two aspects: models that those methods are based on and inference algorithms that those methods use. The advantages and disadvantages of these models and algorithms are discussed. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: The goal is to introduce this important problem domain to data mining researchers by identifying the key issues and challenges inherent to the area as well as provide directions for fruitful future research.
Abstract: Interactions between deoxyribonucleic acid (DNA) and proteins are central to living systems, and characterizing how and when they occur would greatly enhance our understanding of working genomes We review the computational problems associated with protein–DNA interactions and the various methods used to solve them A wide range of topics is covered including physics-based models for direct and indirect recognition, identification of transcription-factor-binding sites, and methods to predict DNA-binding proteins Our goal is to introduce this important problem domain to data mining researchers by identifying the key issues and challenges inherent to the area as well as provide directions for fruitful future research © 2011 Wiley Periodicals, Inc

Journal ArticleDOI
Issam El Naqa1
TL;DR: The role that data mining approaches, particularly machine learning methods, can play to improve the understanding of complex systems such as tumor response to radiotherapy in non-small cell lung cancer is highlighted.
Abstract: Among cancer victims, lung cancer accounts for most fatalities in men and women. Patients at advanced stages of lung cancer suffer from poor survival rate. Majority of these patients are not candidates for surgery and receive radiation therapy (radiotherapy) as their main course of treatment. Despite effectiveness of radiotherapy against many cancers, more than half of these patients are unfortunately expected to fail. Recent advances in biotechnology have allowed for an unprecedented ability to investigate the role of gene regulation in lung cancer development and progression. However, limited studies have provided insight into lung tumor response to radiotherapy. The inherent complexity and heterogeneity of biological response to radiation therapy may explain the inability of existing prediction models to achieve the necessary sensitivity and specificity for clinical practice's or trial's design. In this study, we briefly review the current knowledge of genetic and signaling pathways in modulating tumor response to radiotherapy in non-small cell lung cancer as a case study of data mining application in the challenging cancer treatment problem. We highlight the role that data mining approaches, particularly machine learning methods, can play to improve our understanding of complex systems such as tumor response to radiotherapy. This can potentially result in identification of new prognostic biomarkers or molecular targets to improve response to treatment leading to better personalization of patients' treatment planning by reducing the risk of complications or supporting therapy that is more intensive for those patients likely to benefit. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc. (This work was supported in part by CIHR-MOP-114910 and Fast Foundation grants.)

Journal ArticleDOI
TL;DR: Different DM approaches to oil‐immersed power transformer maintenance are reviewed, including expert systems, fuzzy logic, neural networks, classification and decision, and hybrid intelligent‐based diagnostic systems that apply the DGA database.
Abstract: Knowledge discovery in database and data mining (DM) have emerged as high profile, rapidly evolving, urgently needed, and highly practical approaches to use dissolved gas analysis (DGA) data to monitor conditions and faults in oil-immersed power transformers. This study reviews different DM approaches to oil-immersed power transformer maintenance by discussing historical developments and presenting state-of-the-art DM methods. Relevant publications covering a broad range of artificial intelligence methods are reviewed. Current approaches to the latter method are discussed in the field of DM for oil-immersed power transformers. In this paper, various DM approaches are discussed, including expert systems, fuzzy logic, neural networks, classification and decision, and hybrid intelligent-based diagnostic systems that apply the DGA database. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: In the past decades, very promising approaches to enrich and extend classic static clustering algorithms by dynamic derivatives are suggested; some selected ones will be introduced in this review.
Abstract: Clustering methods are one of the most popular approaches to data mining. They have been successfully used in virtually any field covering domains such as economics, marketing, bioinformatics, engineering, and many others. The classic cluster algorithms require static data structures. However, there is an increasing need to address changing data patterns. On the one hand, this need is generated by the rapidly growing amount of data that is collected by modern information systems and that has led to an increasing interest in data mining as its whole again. On the other hand, modern economies and markets do not deal with stable settings any longer but are facing the challenge to adapt to constantly changing environments. These include seasonal changes but also long-term trends that structurally change whole economies, wipe out companies that cannot adapt to these trends, and create opportunities for entrepreneurs who establish large multinational corporations virtually out of nothing in just one decade or two. Hence, it is essential for almost any organization to address these changes. Obviously, players that have information on changes first possibly obtain a strategic advantage over their competitors. This has motivated an increasing number of researchers to enrich and extend classic static clustering algorithms by dynamic derivatives. In the past decades, very promising approaches have been suggested; some selected ones will be introduced in this review. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field.
Abstract: Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: Objective function‐based clustering algorithm tries to minimize (or maximize) a function such that the clusters that are obtained when the minimum/maximum is reached are homogeneous.
Abstract: Clustering is typically applied for data exploration when there are no or very few labeled data available. The goal is to find groups or clusters of like data. The clusters will be of interest to the person applying the algorithm. An objective function-based clustering algorithm tries to minimize (or maximize) a function such that the clusters that are obtained when the minimum/maximum is reached are homogeneous. One needs to choose a good set of features and the appropriate number of clusters to generate a good partition of the data into maximally homogeneous groups. Objective functions for clustering are introduced. Clustering algorithms generated from the given objective functions are shown, with a number of examples of widely used approaches discussed. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: FiRePat—Finding Regulatory Patterns—an unsupervised data mining tool applicable to large datasets, typically produced by high throughput sequencing of sRNAs and mRNAs or microarray experiments, that detects sRNA‐gene pairs with correlated expression levels is presented.
Abstract: Small RNAs are regulatory RNA fragments which, through RNA silencing, can regulate the expression of genes. Because sRNAs are negative regulators it is generally assumed that expression profiles of sRNAs and their targets are negatively correlated. Recently, examples of positive correlation between the expression of sRNAs and their targets have been discovered. At the moment, it is not known how many sRNA-target pairs are positively and negatively correlated, and it is also not clear in what situations (e.g., under which treatments) any of these correlations can be observed. To determine this, one of the first steps is to develop tools to carry out a genome wide characterization of covariation of expression levels of sRNAs and genes. We present FiRePat—Finding Regulatory Patterns—an unsupervised data mining tool applicable to large datasets, typically produced by high throughput sequencing of sRNAs and mRNAs or microarray experiments, that detects sRNA-gene pairs with correlated expression levels. The method consists of three steps: first, we select differentially expressed sRNAs and genes; second, we compute the correlation between sRNA and gene series for all possible sRNA–gene pairs; and third, we cluster the sRNA or gene expression series, simultaneously inducing clusters in the other series. Potential uses of FiRePat are presented using publicly available sRNA and mRNA datasets for both plants and animals. The standard output of FiRePat, a list of correlated pairs formed with sRNAs and mRNAs, can be used to investigate the cause and consequences of the respective expression patterns. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This work discusses numerous real‐world challenges in building accurate fault prediction models and presents some solutions to these challenges.
Abstract: Mining software repositories (MSRs) such as source control repositories, bug repositories, deployment logs, and code repositories provide useful patterns for practitioners. Instead of using these repositories as record-keeping ones, we need to transform them into active repositories that can guide the decision processes inside the company. By MSRs with several data mining algorithms, effective software fault prediction models can be built and error-prone modules can be detected prior to the testing phase. We discuss numerous real-world challenges in building accurate fault prediction models and present some solutions to these challenges. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: This article introduces and discusses the major categories of sensitive‐knowledge‐protecting methodologies and the aim of these algorithms is to extract as much as nonsensitive knowledge from the collaborative databases as possible while protecting sensitive information.
Abstract: Mining association rules from huge amounts of data is an important issue in data mining, with the discovered information often being commercially valuable. Moreover, companies that conduct similar business are often willing to collaborate with each other by mining significant knowledge patterns from the collaborative datasets to gain the mutual benefit. However, in a cooperative project, some of these companies may want certain strategic or private data called sensitive patterns not to be published in the database. Therefore, before the database is released for sharing, some sensitive patterns have to be hidden in the database because of privacy or security concerns. To solve this problem, sensitive-knowledge-hiding (association rules hiding) problem has been discussed in the research community working on security and knowledge discovery. The aim of these algorithms is to extract as much as nonsensitive knowledge from the collaborative databases as possible while protecting sensitive information. Sensitive-knowledge-hiding problem was proven to be a nondeterministic polynomial-time hard problem. After that, a lot of research has been completed to solve the problem. In this article, we will introduce and discuss the major categories of sensitive-knowledge-protecting methodologies. © 2011 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: The state‐of‐the‐art statistical scoring approaches used in the prediction of drug–target interactions are reviewed, and their operation is illustrated using publicly available data from yeast chemical‐genomic profiling studies.
Abstract: The recent decrease in the rate that new cancer therapies are being translated into clinical use is mainly due to the lack of therapeutic efficacy and clinical safety or toxicology of the candidate drug compounds. An important prerequisite for the development of safe and effective chemical compounds is the identification of their cellular targets. High-throughput screening is increasingly being used to test new drug compounds and to infer their cellular targets, but these quantitative screens result in high-dimensional datasets with many inherent sources of noise. We review here the state-of-the-art statistical scoring approaches used in the prediction of drug–target interactions, and illustrate their operation using publicly available data from yeast chemical-genomic profiling studies. The real data examples underscore the need for developing more advanced data mining approaches for extracting the full information from the high-throughput screens. A particular medical application stems from the concept of synthetic lethality in cancer and how it could open up new opportunities for personalized cancer therapies. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.